ranking where the score is a function of time - cassandra

I would like to migrate my db which is currently on mysql to C*. At the moment I have a table that I have trouble imagining how to "migrate" it.
Entity
Id
score(s)
hotscore
Where hotscore is f(s,d) = log10 + (s.t/45000). S is score and t is timestamp since epoch.
Essentially what I would be looking into querying is the top 20 of that entity. With mysql and a cron job I'm updating the hotscore every minute. For that reason hot score cannot be suited for a partition key. I'm trying to see if I can make this happen before moving to c*. As far as I know a primary key like (id, hotscore) wouldn't be good because it means C* has to scan every entry.

You'll soon be able to handle this use case with materialized views when Cassandra 3.0 is released.
See an example of ordering rows in a materialized view here and here.
The way it works is in your base table you don't use the score as a clustering column, but you do use it as a clustering column in the materialized view. Then when you update the base table, the ordering in the view is automatically updated.

Related

Cassandra - Shall I have to do so many writes?

I have 5 Tables:
users_by_id
users_by_username
users_by_email
users_by_likes
users_by_followers
I have to write 5 Statements every time if a user registered. Is that not expensive or bad ?
INSERT INTO users_by_id (...) values (..)
INSERT INTO users_by_email (...) values (..)
INSERT INTO users_by_username (...) values (..)
INSERT INTO users_by_likes (...) values (..)
INSERT INTO users_by_followers (...) values (..)
The second question: Maybe I update users_by_id I have to write 5 Update statments. Is there another solution? Or is that not this bad ?
Cassandra advocates denormalization of your data and creating data model according to your queries. You will have to write your data model such that it satisfies all the queries with good performance. For performance (due to its architecture and design) Cassandra asks for writing and reading using partition key.
It is not expensive to write 5 insertions for same set of data in 5 different tables. Your reads will perform better and as data size increases to web scale, you will thank your decision of creating 5 tables and writing to them.
You can explore materialized views (Materialized View and Datastax Link for Materialized View but remember it is an experimental feature. So you have to understand it properly and also identify open issues with materialized views.
I would recommend you study Cassandra data model that will make things easier to grasp.
Cassandra is designed to be write intensive database so do not hesitate to duplicate your data. One should always design tables for the read queries. If one table satisfies one query, it is a fine design.
Answer to your second question, you should design your tables such a way that you do not have to update table. Always think about inserting new values.
For example, below table design
CREATE TABLE user_by_email (
email text,
timestamp timestamp,
name text,
fullname text,
userId text,
PRIMARY KEY (email,timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
INSERT INTO user_by_email (email, DateTime.Now ........)
In this design, you should get the latest inserted value. Additionally , this design keeps change history for that key.
Think about, how many times we have to update values like user id, email, username? rarely.

Cassandra data aggregation and rollup

which is the best way to aggregate and store back data in a Cassandra cluster? I mean, having a table with hour data, aggregate at day and save in a different table. This can be simply achieved with select and insert for every key/period, but is there a better or different way? What about materialized views?
Materialized views
Usage of materialized views in cassandra is quite limited :
all primary keys from the source table must appear in the view, possibly in a different order.
aggregate functions like avg cannot be used
GROUP BY is not allowed
So I do not think it is suitable for your time-based rollup, nor any other aggregations.
By the way, materialized view has been retroactively classified
as experimental, and not recommended for new production uses.
Manual solution
This is great as soon as the data to aggregate is frozen, forever... If not, consistency will be hard to handle.
Indexes
A completely different approach to the rollup would be to use Elassandra to index the temporal column. An elasticsearch secondary index we'll be created and keep in sync automatically. Then use the embed elasticsearch API to query at different time scales, using date histogram aggregation.
This way the result of aggregations is not stored, but calculated in real-time from a efficient secondary data structure.

How to maintain data consistency across multiple tables in cassandra?

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.
For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries
CREATE TABLE fans (
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY ((fan_id))
)
CREATE TABLE actors (
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY ((actor_id))
)
CREATE TABLE actors_by_fan (
fan_id uuid,
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY (fan_id, actor_id)
)
CREATE TABLE fans_by_actor (
actor_id uuid,
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY (actor_id, fan_id)
)
Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.
On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.
However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.
This problem occurs for any time you want to update any attribute of either fans or actors.
I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.
What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?
EDIT:
- Fixed sentence stubs
- Added context and prior research
Data Modeling
Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.
1) Spread data evenly around the cluster
2) Minimize the number of partitions read
Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.
Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.
More Info:
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
CQL
Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0
Please see
How to ensure data consistency in Cassandra on different tables?
My preference would be to change the data model and design it
accordingly for our queries and if possible make it as a single big table.
Hope it Helps!
Materialized Views are probably the best choice:
CREATE MATERIALIZED VIEW actors_by_fan
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans
PRIMARY KEY (fan_id, actor_id);
CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors
PRIMARY KEY (actor_id, fan_id);
In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.
The way you solve these kind of problems is to manually update all the changed records.
Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:
Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.
As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.
This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).

select older versions of data after update in Cassandra

This is my use-case.
I have inserted a row of data in Cassandra with the following query:
INSERT INTO TableWide1 (UID, TimeStampCol, Value, DateCol) VALUES ('id1','2016-03-24 17:54:36',45,'2015-03-24 00:00:00');
I update one row to have a new value.
update TableWide1 set Value = 46 where uid = 'id1' and datecol='2015-03-24 00:00:00' and timestampcol='2016-03-24 17:54:36';
Now, I would like to see all versions of this data from Cassandra. I know in HBase, this is pretty straightforward, but in Cassandra, is this even possible?
I explored a bit using writetime(), but it just gives the latest time of the newly updated data. And this cannot be used in where clause too.
This is how my schema looks like:
CREATE TABLE TableWide1(
UID varchar,
TimeStampCol timestamp,
Value double,
DateCol timestamp,
PRIMARY KEY ((UID,DateCol), TimeStampCol)
);
So is this technically possible, given the fact the old data still exists in Cassandra?
If your partitions wont get too wide you could exclude the time partitioning:
CREATE TABLE table_wide (
UID varchar,
TimeStampCol timestamp,
Value double,
PRIMARY KEY ((UID), TimeStampCol)
);
Thats generally bad though since eventually you will hit the limits of a partition.
But really you had it right. You wont be able to make a single statement, but under the covers you cant stream the entire set over anyway, and it will have to page through it. So you can just iterate through results of each day one at a time. If your dataset has days with no data and you dont want to waste reads, you can keep an additional table around to mark which days have data
CREATE TABLE table_wide_partition_list (
UID varchar,
DateCol timestamp,
PRIMARY KEY (UID)
);
And make one query to it first.
Really if you want HBase like behavior for scans, you are probably looking for more OLAP style of thing instead of normal C* usage. For this its almost universally recommended to use Spark with Cassandra currently.
Cassandra does not retain old data when updated.
It marks the old data into tombstone, and get rid of this, when compaction happens.
Hbase, was not made for handling real time application, and hot data from/for application server, though things have improved since the old times with Hbase.
People use Hbase, mainly because they already have a hadoop cluster.
Another noticeable and important difference is Cassandra is very fast on retrieval of single/multiple record based on key but not on range like >10 && <10 because data is stored based on hashed key. Hbase on the other hand stores data in sorted manner and is ideal candidate for range query.
Anyways, since cassandra doesn't retain old data. You cannot retrieve it.

Cassandra or Hbase?

I have a requirement, where I want to store the following:
Mac Address // PKEY
TimeStamp // PKEY
LocationID
ownerName
Signal Strength
The insertion logic is as follows:
Store the above statistics for each active device (MacAddress) once every hour at each location (LocationID)
The entries are created at end of each hour, so the primary key will always be MAC+TimeStamp
There are no updates, only insertions
The queries which can be performed are as follows:
Give me all the entries for last 'N' hours Where MacAddress = "...."
Give me all the entries for last 'N' hours Where LocationID IN (locID1, locID2, ..);
Needless to say, there are billions of entries, and I want to use either HBASE or Cassandra. I've tried to explore, and it seems that Cassandra may not be correct choice.
The reasons for that is if I have the following in cassandra:
< < RowKey > MacAddress:TimeStamp > >
+ LocationID
+ OwnerName
+ Signal Strength
Both the queries will scan the whole database, right? Even if I add an index on LocationID, that is only going to help in the second query to some extent, because there is no index on timestamp (I believe that seaching on timestamp is not fast, as the MacAddress:TimeStamp composite Key would not allow us to search only on timestamp, and instead, a full scan would happen, is that correct?).
I'm stuck here big time, and any insights would really help, if we should opt HBase or Cassandra.
The right way to model this with Cassandra is to use a table partitioned by mac address, ordered by timestamp, and indexed on location id. See the Cassandra data model documentation, especially the section on clustering [predefined sorting]. None of your queries will require a full table scan.
You have to remember that NoSql instances like Cassandra allow horizontal scaling and make it a lot easier to shard the data. By developing a shard strategy (identifying shard key, etc) you could dramatically reduce the size of the data on a single instance and make queries (even when trying to query massive data sets) doable.
Either one would work for this query:
Give me all the entries for last 'N' hours Where MacAddress = "...."
In cassandra you would want to use an ordered partitioner so you can do easy scans. That way you would not have to scan the entire table. (I'm a little rusty on Cassandra).
In hbase it is always ordered by the rowkey so the scan becomes easy. You would just set a start and stop rowkey. Conceptually it would be:
scan.setStartRow(mac+":"+timestamp);
scan.setStopRow(mac+":"+endtimestamp);
And then it would only scan over the rows for the given mac address for the given time period--only a small subset of the data.
This query is much harder:
Give me all the entries for last 'N' hours Where LocationID IN
(locID1, locID2, ..);
Cassandra does have secondary indexes so it seems like it would be "easy" but I don't know how much data it would scan through. I haven't looked at Cassandra since it added secondary indexes.
In hbase you'd have to scan the entire table or create a second table. I would recommend creating a second table where the rowkey would be < location:timestamp > and you'd duplicate the data. Then you'd use that table to lookup the data by location using a scan and setting the start and end keys.

Resources