No-SQL (Cassandra) data modelling of user data - cassandra

How do you model user data in Cassandra?
A single table for user data, partitioned by user-ID, with different components reading/writing to different columns?
Multiple tables (one per component) with the same key structure, that occasionally need to be "joined" together on partition key?
We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.
This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.
On the positive side, inserting to tables is done independently.
Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?
Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?
To make this more concrete, let's say we have:
CREATE TABLE example (
pk text PRIMARY KEY
col_a text
col_b text
)
Assume that for a given partition key (pk), initially both col_a, and col_b have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?

Summary
Write conflicts are something you shouldn't need to worry about. All INSERTS/UPDATES/DELETES are Upserts in Cassandra. Everything in Cassandra is column based.
Cassandra uses a last-write wins strategy to manage conflict. As you can see by be example below, whenever you change a value the timestamp associated with that column is updated. Since you are running concurrent updates, and one thread will update col_a and another will update col_b.
Example
Initial Insert
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;
pk | col_a | col_b
----+---------+--------------
1 | deckard | Blade Runner
(1 rows)
Timestamps are the same in the initial insert
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+--------------+------------------
1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357
(1 rows)
Once col_b is uptated, it's timestamp changes to reflect the change.
cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
1 | Deckard | 1526916970412357 | Rick | 1526917272641682
(1 rows)
After col_a is updated it too get's its timestamp updated to the new value
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
1 | bounty hunter | 1526917323082217 | Rick | 1526917272641682
(1 rows)
Recommendation
My recommendation is that you use one single table that serves your query needs. If you need to query by pk, then create one single table with all columns you need. This way you will have a single wide row that can be read back efficiently, as part of a single query.
The datamodel you describe in option 2 is a bit to relational, and is not optimal for Cassandra. You cannot perform joins natively in cassandra and you should avoid preforming joins on the client side.
Data Mode Rules:
Rule 1: Spread data Evenly across the cluster
You will want to create a partition key that will ensure the data is evenly distributed across the cluster and you don't have any hotspots.
Rule 2: Minimize the number of partitions Read
Each partition may reside in different nodes, so you should try to create a scenario where your queries go ideally to only one node for performance sake.
Rule 3: Model around your queries
Determine what queries to support
Create a table that satisfies your query (meaning that you should use one table per query pattern).
If you need to support more query patterns, then denormalize your data into additional tables that serve those queries. Avoid Secondary Indexes and Materialized Views, as they are not stable at the moment and the first one can create major performance issues when you start to increase your cluster.
If you want to read a little bit more about this I suggest this datastax page:
Basic Rules of Cassandra Data Modeling

Related

Retrieving bucketting value in WITH statement for subsequent SELECT

I have several tables with bucketing applied. It can work great when I specify the bucket/partition parameter upfront in my SELECT query, however when I retrieve the bucket value I need from a different table - within a WITH select statement, Hive/Athena seems to no longer use the optimisation, and searches the entire database instead. I would like to learn if there is a way to write my query properly to maintain the optimisation.
For a simple example, I have two tables:
Table1
category | categoryid
---------+-----------
mass | 1
Table2
categoryid | index | value
-----------+-------+------
1 | 0 | 15
1 | 1 | 10
1 | 2 | 7
The bucketed/clustered column is categoryid. I have a single category ('mass') and would like to retrieve the value's that correspond with the category I have. So I have designed my SELECT like this:
WITH dataset AS (
SELECT categoryid
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid=dataset.categoryid
This will run, but will search the entire database it seems, because Hive doesn't know the categoryid for bucketing before commencing the search? If I swap out the final Table2.categoryid=dataset.categoryid for Table2.categoryid=1 then it will search only the fraction of the db.
So is there some way of writing this query to ensure Hive doesn't search more buckets in the second table than it has to?
Athena is based on Presto. Unless there is some modification in Athena in this area (and I think there currently isn't), this cannot be made to work in single query.
Recommended workaround: issue one query to gather dataset.categoryid values. Pass them as constant to your main query:
WITH dataset AS (
SELECT category
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid = dataset.categoryid
AND Table2.categoryid IN ( <all possible values> );
This is going to be improved with the additional of Dynamic Filtering in Presto, that the Presto Community is working on currently.

Cassandra returns Unordered result set for numeric values

I am new to No SQL and just started learning Cassandra, I have a following question to ask. I have created a simple table with one column to understand Cassandra partition and clustering and trying to query all the values after insertion.
My table structure
create table if not exists music_library(custno int, primary key(custno))
I inserted following values in a sequential order
insert into music_library(custno) values (11)
insert into music_library(custno) values (12)
insert into music_library(custno) values (13)
insert into music_library(custno) values (14)
then I was querying this table
select * from music_library
it returns values in the following order
13
11
14
12
but i was expecting
11
12
13
14
Why its behaving like that?
I ran your exact statements and produced the same result. But I also adjusted your query to run the token function, and this is what it produced:
aaron#cqlsh:stackoverflow> select custno,token(custno) from music_library;
custno | system.token(custno)
--------+----------------------
13 | -5034495173465742853
11 | -4156302194539278891
14 | 4279681877540623768
12 | 8582886034424406875
(4 rows)
Why its behaving like that?
Simply put, because Cassandra cannot order results by the values of the partition keys.
As your table has a single primary key of custno, your rows are partitioned by the hashed token value of custno, and written to the nodes responsible for those token ranges. When you run an unbound query in Cassandra (query without a WHERE clause), the results are returned ordered by the hashed token values of their partition keys.
Using ORDER BY won't work here, either. ORDER BY can only sort data within a partition, and even then only on clustering keys. To get the custno values to order properly, you will need to find a new partition key, and then specify custno as a clustering key in an ascending direction.
Edit 20190916 - follow-up clarifications
Does this tokenization will happen for all the columns?
No. The partition keys are hashed into a token to determine their placement in the cluster (which node(s) they are written to). Individual column values are written within a partition.
How will I return the inserted number with the order?
You cannot alter the order of this table without changing the model. Simply put, you'll have to find a way to organize the values you expect to return (with your query) together (find another partition key). Exactly how that looks depends on your business/query requirements.
For example, let's say that I wanted to track which customers purchased specific music albums. I might create a table that looks like this:
CREATE TABLE customers_by_album (
album TEXT,
band TEXT,
custno INT,
PRIMARY KEY (album,custno))
WITH CLUSTERING ORDER BY (custno ASC);
After inserting some data, the following query returns results ordered by custno:
aaron#cqlsh:stackoverflow> SELECT album,token(album),band,custno FROM
customers_by_album WHERE album='Moving Pictures';
album | system.token(album) | band | custno
-----------------+---------------------+------+--------
Moving Pictures | 7819329704333693835 | Rush | 11
Moving Pictures | 7819329704333693835 | Rush | 12
Moving Pictures | 7819329704333693835 | Rush | 13
Moving Pictures | 7819329704333693835 | Rush | 14
(4 rows)
This works, because I am querying data by a partition (album), and then I am "clustering" on custno which leverages the on-disk sort order. This is also the order the data was written to disk in, so Cassandra just reads it from the partition sequentially.
I wrote an article on this topic for DataStax a few years ago, and it's still quite relevant. Give it a read if you get a chance: https://www.datastax.com/dev/blog/we-shall-have-order

Group data and extract average in Cassandra cqlsh

Lets say we have a key-space named sensors and a table named sensor_per_row.
this table has the following structure :
sensor_id | ts | value
In this case senor_id represents the partition key and ts (which is the date of the record created ) represents the clustering key.
select sensor_id, value , TODATE(ts) as day ,ts from sensors.sensor_per_row
The outcome of this select is
sensor_id | value | day | ts
-----------+-------+------------+---------------
Sensor 2 | 52.7 | 2019-01-04 | 1546640464138
Sensor 2 | 52.8 | 2019-01-04 | 1546640564376
Sensor 2 | 52.9 | 2019-01-04 | 1546640664617
How can I group data by ts more specifically group them by date and return the day average value for each row of the table using cqlsh. for instance :
sensor_id | system.avg(value) | day
-----------+-------------------+------------
Sensor 2 | 52.52059 | 2018-12-11
Sensor 2 | 42.52059 | 2018-12-10
Sensor 3 | 32.52059 | 2018-12-11
One way i guess is to use udf (user defined functions ) but this function runs only for one row . Is it possible to select data inside udf ?
Another way is using java etc. , with multiple queries for each day or with processing the data in some other contact point as a rest web service ,but i don't now about the efficiency of that ... any suggestion ?
NoSQL Limitations
While working with NoSQL, we generally have to give up:
Some ACID guarantees.
Consistency from CAP.
Shuffling operations: JOIN, GROUP BY.
You may perform above operations by reading data(rows) from the table and summing.
You can also refer to the answer MAX(), DISTINCT and group by in Cassandra
So I found the solution , I will post it in case somebody else has the same question.
As I read the data modeling seems to be the answer. Which means :
In Cassandra db we have partition keys and clustering keys .Cassandra has the ability of handling multiple inserts simultaneously . That gives us the possibility of inserting the data in more than one table at simultaneously , which pretty much means we can create different tables for the same data collection application , which will be used in a way as Materialized views (MySql) .
For instance lets say we have the log schema {sensor_id , region , value} ,
The first comes in mind is to generate a table called sensor_per_row like :
sensor_id | value | region | ts
-----------+-------+------------+---------------
This is a very efficient way of storing the data for a long time , but given the Cassandra functions it is not that simple to visualize and gain analytics out of them .
Because of that we can create different tables with ttl (ttl stands for time to live) which simply means for how long the data will be stored .
For instance if we want to get the daily measurements of our specific sensor we can create a table with day & sensor_id as partition keys and timestamp as clustering key with Desc order.
If we add and a ttl value of 12*60*60*60 which stands for a day, we can store our daily data.
So creating lets say a table sensor_per_day with the above format and ttl will actual give as the daily measurements .And at the end of the day ,the table will be flushed with the newer measurements while the data will remained stored in the previews table sensor_per_row
I hope i gave you the idea.

Are there any major disadvantages to having multiple clustering columns in cassandra?

I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.
The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:
CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))
My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:
CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))
(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.
aploetz#cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash | pid | data
----+----+----+----+----+----+----+----+--------------+------+---------------
d | p | 8 | 9 | v | c | n | e | dp89vcnem4n | 1001 | Beloit, WI
d | p | 8 | c | p | w | g | v | dp8cpwgv3 | 1003 | Harvard, IL
d | p | c | 8 | g | e | k | t | dpc8gektg8w7 | 1002 | Sheboygan, WI
9 | x | j | 6 | 5 | j | 5 | 1 | 9xj65j518 | 1004 | Denver, CO
(4 rows)
As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.
The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.
Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:
Without compact storage:
[default#stackoverflow] list georecords3;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
2 Rows Returned.
Elapsed time: 217 msec(s).
With compact storage:
[default#stackoverflow] list georecords2;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
2 Rows Returned.
Elapsed time: 39 msec(s).
But, I would recommend against using WITH COMPACT STORAGE for the following reasons:
You cannot add or remove columns after table creation.
It prevents you from having multiple non-key columns in the table.
It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.
I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.
The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.
By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.
Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.
Edit
The still unresolved question for me is which approach will scale better, in which situations.
Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.
Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.
I may just have to play with the data more and run test cases.
That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.
Here's the best Cassandra modeling guide I've found: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
I've used composite columns (6 of them) successfully for very high write/read loads. There is no significant performance penalty when using compact storage (http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html).
Compact storage means the data is stored internally in a single row, with the limitation that you can only have one data column. That seems to suit your application well, regardless of which data model you choose, and would make maximal use of your geo_key filtering.
Another aspect to consider is that the columns are sorted in Cassandra. Having more clustering columns will improve the sorting speed and potentially the lookup.
However, in your case, I'd start with having the geohash as a row key and turn on row cache for fast lookup (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1). If the performance is lacking there, I'd run performance tests on different data representations.

Does CQL3 require a schema for Cassandra now?

I've just had a crash course of Cassandra over the last week and went from Thrift API to CQL to grokking SuperColumns to learning I shouldn't use them and user Composite Keys instead.
I'm now trying out CQL3 and it would appear that I can no longer insert into columns that are not defined in the schema, or see those columns in a select *
Am I missing some option to enable this in CQL3 or does it expect me to define every column in the schema (defeating the purpose of wide, flexible rows, imho).
Yes, CQL3 does require columns to be declared before used.
But, you can do as many ALTERs as you want, no locking or performance hit is entailed.
That said, most of the places that you'd use "dynamic columns" in earlier C* versions are better served by a Map in C* 1.2.
I suggest you to explore composite columns with "WITH COMPACT STORAGE".
A "COMPACT STORAGE" column family allows you to practically only define key columns:
Example:
CREATE TABLE entities_cargo (
entity_id ascii,
item_id ascii,
qt ascii,
PRIMARY KEY (entity_id, item_id)
) WITH COMPACT STORAGE
Actually, when you insert different values from itemid, you dont add a row with entity_id,item_id and qt, but you add a column with name (item_id content) and value (qt content).
So:
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 1',3);
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 2',3);
Now, here is how you see this rows in CQL3:
cqlsh:goh_master> select * from entities_cargo where entity_id = 100;
entity_id | item_id | qt
-----------+-----------+----
100 | oggetto 1 | 3
100 | oggetto 2 | 3
And how they are if you check tnem from cli:
[default#goh_master] get entities_cargo[100];
=> (column=oggetto 1, value=3, timestamp=1349853780838000)
=> (column=oggetto 2, value=3, timestamp=1349853784172000)
Returned 2 results.
You can access a single column with
select * from entities_cargo where entity_id = 100 and item_id = 'oggetto 1';
Hope it helps
Cassandra still allows using wide rows. This answer references that DataStax blog entry, written after the question was asked, which details the links between CQL and the underlying architecture.
Legacy support
A dynamic column family defined through Thrift with the following command (notice there is no column-specific metadata):
create column family clicks
with key_validation_class = UTF8Type
and comparator = DateType
and default_validation_class = UTF8Type
Here is the exact equivalent in CQL:
CREATE TABLE clicks (
key text,
column1 timestamp,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
Both of these commands create a wide-row column family that stores records ordered by date.
CQL Extras
In addition, CQL provides the ability to assign labels to the row id, column and value elements to indicate what is being stored. The following, alternative way of defining this same structure in CQL, highlights this feature on DataStax's example - a column family used for storing users' clicks on a website, ordered by time:
CREATE TABLE clicks (
user_id text,
time timestamp,
url text,
PRIMARY KEY (user_id, time)
) WITH COMPACT STORAGE
Notes
a Table in CQL is always mapped to a Column Family in Thrift
the CQL driver uses the first element of the primary key definition as the row key
Composite Columns are used to implement the extra columns that one can define in CQL
using WITH COMPACT STORAGE is not recommended for new designs because it fixes the number of possible columns. In other words, ALTER TABLE ... ADD is not possible on such a table. Just leave it out unless it's absolutely necessary.
interesting, something I didn't know about CQL3. In PlayOrm, the idea is it is a "partial" schema you must define and in the WHERE clause of the select, you can only use stuff that is defined in the partial schema BUT it returns ALL the data of the rows EVEN the data it does not know about....I would expect that CQL should have been doing the same :( I need to look into this now.
thanks,
Dean

Resources