In Cassandra, how to implement a fixed number of rows in one partition? - cassandra

All
I'm implementing a kind of history table using Cassandra 2.2.
My current schema has a row key for userid, and cluster key for timestamp, then in each row is a user behavior record. I want to keep only 10 latest rows for an given userid. How can I implement this smartly?
Thanks for any suggestion!

Given a Data model of:
CREATE TABLE history (
userid text,
activity_time timeuuid,
behavior text,
PRIMARY KEY ((userid),timeuuid)
);
The best I can think of would be to do the following:
Insert all "history" records with some reasonable TTL.
How long of a TTL depends on your particular use case
When querying by a userid, limit your returned result set to 10
SELECT * FROM history WHERE userid='fromanator' LIMIT 10;
However with this approach if a user hasn't had any history within the TTL then you will get no results back. Depending on your use case this may be acceptable.
If you absolutely need to keep at least the last 10 records, then you're going to have a much more complicated data model and application code to achieve this in Cassandra.

This may not be the most elegant solution and won't strictly adhere to only storing 10 records at any given time, but you could store the row data as a list (if there is structure to the row data, you'd have to handle this structuring yourself or use user defined types). If you already have this list available to you when you write to it, you'd just truncate it to the latest 10 values before writing, otherwise you could wait until the next time a read is done on that list, truncate it to 10 records, then write that back to Cassandra.
If you're not so much concerned with how much data is stored, but rather are only interested in retrieving the last 10 results, then fromanator's solution (with or without a TTL depending on whether you care more about the size of the data or ensuring 10 results) is the best.

Related

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

select older versions of data after update in Cassandra

This is my use-case.
I have inserted a row of data in Cassandra with the following query:
INSERT INTO TableWide1 (UID, TimeStampCol, Value, DateCol) VALUES ('id1','2016-03-24 17:54:36',45,'2015-03-24 00:00:00');
I update one row to have a new value.
update TableWide1 set Value = 46 where uid = 'id1' and datecol='2015-03-24 00:00:00' and timestampcol='2016-03-24 17:54:36';
Now, I would like to see all versions of this data from Cassandra. I know in HBase, this is pretty straightforward, but in Cassandra, is this even possible?
I explored a bit using writetime(), but it just gives the latest time of the newly updated data. And this cannot be used in where clause too.
This is how my schema looks like:
CREATE TABLE TableWide1(
UID varchar,
TimeStampCol timestamp,
Value double,
DateCol timestamp,
PRIMARY KEY ((UID,DateCol), TimeStampCol)
);
So is this technically possible, given the fact the old data still exists in Cassandra?
If your partitions wont get too wide you could exclude the time partitioning:
CREATE TABLE table_wide (
UID varchar,
TimeStampCol timestamp,
Value double,
PRIMARY KEY ((UID), TimeStampCol)
);
Thats generally bad though since eventually you will hit the limits of a partition.
But really you had it right. You wont be able to make a single statement, but under the covers you cant stream the entire set over anyway, and it will have to page through it. So you can just iterate through results of each day one at a time. If your dataset has days with no data and you dont want to waste reads, you can keep an additional table around to mark which days have data
CREATE TABLE table_wide_partition_list (
UID varchar,
DateCol timestamp,
PRIMARY KEY (UID)
);
And make one query to it first.
Really if you want HBase like behavior for scans, you are probably looking for more OLAP style of thing instead of normal C* usage. For this its almost universally recommended to use Spark with Cassandra currently.
Cassandra does not retain old data when updated.
It marks the old data into tombstone, and get rid of this, when compaction happens.
Hbase, was not made for handling real time application, and hot data from/for application server, though things have improved since the old times with Hbase.
People use Hbase, mainly because they already have a hadoop cluster.
Another noticeable and important difference is Cassandra is very fast on retrieval of single/multiple record based on key but not on range like >10 && <10 because data is stored based on hashed key. Hbase on the other hand stores data in sorted manner and is ideal candidate for range query.
Anyways, since cassandra doesn't retain old data. You cannot retrieve it.

Defining a partition key in Cassandra

I'm playing around with Cassandra for the first time and I feel like I understand the basics and limits. I'm working with the following model, as an example, for storing tweets collected by hashtag.
create table posts
(
id text,
status text,
service text,
hashtag text,
username text,
caption text,
image text,
link text,
repost boolean,
created timestamp,
primary key (hashtag, created)
);
This works very well for the type of query I need:
select * from posts where hashtag = 'demo' order by created desc;
However, if I understand things correctly, there is an upper limit to the number of posts I could store using the singular 'demo' partition key and more importantly, the entire set of posts matching the 'demo' partition key would have to be stored with each replica. I'd should probably use a more random or variable partition key (maybe the id of the post) if I understand correctly, but I don't know what to use that won't alter the requirements for the query.
If I use id as the partition key (e.g. PRIMARY KEY (id, created)) and add a secondary index on the hashtag column, I get the following error when I run my query:
ORDER BY with 2ndary indexes is not supported.
I get that to use ORDER BY, the partition key must be featured in the where clause, hence my original thought to use hashtag.
Am I overthinking things or is there a better candidate for the partition key?
The direction you go would depend on what volume of writes you expect and how big your cluster is.
If you have a small user community and a small cluster, then you might be overthinking things. A partition can theoretically hold up to 2 billion rows. That's a big number, and would anyone actually want to view more than a few thousand of the most recent tweets for a hashtag? So you'd probably have some kind of cleanup mechanism such as using TTL to delete tweets after some amount of time, which will free up space in the partition, keeping you well below the 2 billion row limit.
If you don't want to cleanup up old tweets, but want to preserve them for many years, then you might want to use a compound partition key like this:
primary key ((hashtag, year), created)
This would partition the tweets by the tag and the year, so you could store up to 2 billion tweets per tag per year.
The nice thing about partitioning by hashtag is that Cassandra can keep the tweets for a tag sorted by the creation timestamp, making it easy to retrieve the most recent ones with a single query as you've shown.
But if your user community is big, then the issue that is of a bigger concern is avoiding hot spots. If you use just hashtag and a time bin like year for a partition key, then all reads and writes will be to the small number of replicas for that hashtag. If a hashtag is very active on a given day, then you've got all your reads and writes going to just a node or two depending on what replication factor you are using.
If you want to spread out the read and write load, you need to increase the cardinality of a hashtag so that it will map to multiple nodes. Using id as the partition key would achieve this, but that would be going too far since then every tweet would be in a separate partition and you'd get no sorting or easy way to retrieve the most recent tweets for a hashtag.
So a better approach is to create separate bins or buckets, like this:
primary key ((hashtag, bin), created)
The number of bins you create depends on your write load. Let's say you decide that ten nodes can handle the write load for a hot hashtag, then bin would be a value from 0 to 9.
There are a number of ways to set the bin number. You could do a modulo of id by 10, or pick a random number between 0 and 9, or generate a hash value from some combination of fields and take modulo 10 of the results. Whatever method you choose, make sure the numbers from 0 to 9 are equally likely so that your data is spread equally across the bin partitions.
With multiple bins, it is not as easy to retrieve the x most recent tweets for a hashtag since you need to query all the bins and merge the results. You can asynchronously issue a query for each bin of a hashtag in parallel and then merge the results on the client side. Or you can do a single query using the IN clause like this:
select * from posts where hashtag = 'demo' and bin IN (0,1,2,3,4,5,6,7,8,9) AND created > ...
But Cassandra won't sort the results of the single query, so you'd have to do a sort on the client side, which is slower than doing a merge of separate ordered queries.
Now in many cases there will be hashtags that have very little volume, so you might not want to bother using ten bins for them unless they get hot. If so you can make it dynamic in your application, typically using just bin 0, but then increasing the number of bins when a tag is found to be popular. You could use a static column in bin 0 to keep track of the number of active bins for a hashtag.
You should avoid using secondary indexes. They are very inefficient in Cassandra.

Cassandra or Hbase?

I have a requirement, where I want to store the following:
Mac Address // PKEY
TimeStamp // PKEY
LocationID
ownerName
Signal Strength
The insertion logic is as follows:
Store the above statistics for each active device (MacAddress) once every hour at each location (LocationID)
The entries are created at end of each hour, so the primary key will always be MAC+TimeStamp
There are no updates, only insertions
The queries which can be performed are as follows:
Give me all the entries for last 'N' hours Where MacAddress = "...."
Give me all the entries for last 'N' hours Where LocationID IN (locID1, locID2, ..);
Needless to say, there are billions of entries, and I want to use either HBASE or Cassandra. I've tried to explore, and it seems that Cassandra may not be correct choice.
The reasons for that is if I have the following in cassandra:
< < RowKey > MacAddress:TimeStamp > >
+ LocationID
+ OwnerName
+ Signal Strength
Both the queries will scan the whole database, right? Even if I add an index on LocationID, that is only going to help in the second query to some extent, because there is no index on timestamp (I believe that seaching on timestamp is not fast, as the MacAddress:TimeStamp composite Key would not allow us to search only on timestamp, and instead, a full scan would happen, is that correct?).
I'm stuck here big time, and any insights would really help, if we should opt HBase or Cassandra.
The right way to model this with Cassandra is to use a table partitioned by mac address, ordered by timestamp, and indexed on location id. See the Cassandra data model documentation, especially the section on clustering [predefined sorting]. None of your queries will require a full table scan.
You have to remember that NoSql instances like Cassandra allow horizontal scaling and make it a lot easier to shard the data. By developing a shard strategy (identifying shard key, etc) you could dramatically reduce the size of the data on a single instance and make queries (even when trying to query massive data sets) doable.
Either one would work for this query:
Give me all the entries for last 'N' hours Where MacAddress = "...."
In cassandra you would want to use an ordered partitioner so you can do easy scans. That way you would not have to scan the entire table. (I'm a little rusty on Cassandra).
In hbase it is always ordered by the rowkey so the scan becomes easy. You would just set a start and stop rowkey. Conceptually it would be:
scan.setStartRow(mac+":"+timestamp);
scan.setStopRow(mac+":"+endtimestamp);
And then it would only scan over the rows for the given mac address for the given time period--only a small subset of the data.
This query is much harder:
Give me all the entries for last 'N' hours Where LocationID IN
(locID1, locID2, ..);
Cassandra does have secondary indexes so it seems like it would be "easy" but I don't know how much data it would scan through. I haven't looked at Cassandra since it added secondary indexes.
In hbase you'd have to scan the entire table or create a second table. I would recommend creating a second table where the rowkey would be < location:timestamp > and you'd duplicate the data. Then you'd use that table to lookup the data by location using a scan and setting the start and end keys.

Resources