I'm playing around with Cassandra for the first time and I feel like I understand the basics and limits. I'm working with the following model, as an example, for storing tweets collected by hashtag.
create table posts
(
id text,
status text,
service text,
hashtag text,
username text,
caption text,
image text,
link text,
repost boolean,
created timestamp,
primary key (hashtag, created)
);
This works very well for the type of query I need:
select * from posts where hashtag = 'demo' order by created desc;
However, if I understand things correctly, there is an upper limit to the number of posts I could store using the singular 'demo' partition key and more importantly, the entire set of posts matching the 'demo' partition key would have to be stored with each replica. I'd should probably use a more random or variable partition key (maybe the id of the post) if I understand correctly, but I don't know what to use that won't alter the requirements for the query.
If I use id as the partition key (e.g. PRIMARY KEY (id, created)) and add a secondary index on the hashtag column, I get the following error when I run my query:
ORDER BY with 2ndary indexes is not supported.
I get that to use ORDER BY, the partition key must be featured in the where clause, hence my original thought to use hashtag.
Am I overthinking things or is there a better candidate for the partition key?
The direction you go would depend on what volume of writes you expect and how big your cluster is.
If you have a small user community and a small cluster, then you might be overthinking things. A partition can theoretically hold up to 2 billion rows. That's a big number, and would anyone actually want to view more than a few thousand of the most recent tweets for a hashtag? So you'd probably have some kind of cleanup mechanism such as using TTL to delete tweets after some amount of time, which will free up space in the partition, keeping you well below the 2 billion row limit.
If you don't want to cleanup up old tweets, but want to preserve them for many years, then you might want to use a compound partition key like this:
primary key ((hashtag, year), created)
This would partition the tweets by the tag and the year, so you could store up to 2 billion tweets per tag per year.
The nice thing about partitioning by hashtag is that Cassandra can keep the tweets for a tag sorted by the creation timestamp, making it easy to retrieve the most recent ones with a single query as you've shown.
But if your user community is big, then the issue that is of a bigger concern is avoiding hot spots. If you use just hashtag and a time bin like year for a partition key, then all reads and writes will be to the small number of replicas for that hashtag. If a hashtag is very active on a given day, then you've got all your reads and writes going to just a node or two depending on what replication factor you are using.
If you want to spread out the read and write load, you need to increase the cardinality of a hashtag so that it will map to multiple nodes. Using id as the partition key would achieve this, but that would be going too far since then every tweet would be in a separate partition and you'd get no sorting or easy way to retrieve the most recent tweets for a hashtag.
So a better approach is to create separate bins or buckets, like this:
primary key ((hashtag, bin), created)
The number of bins you create depends on your write load. Let's say you decide that ten nodes can handle the write load for a hot hashtag, then bin would be a value from 0 to 9.
There are a number of ways to set the bin number. You could do a modulo of id by 10, or pick a random number between 0 and 9, or generate a hash value from some combination of fields and take modulo 10 of the results. Whatever method you choose, make sure the numbers from 0 to 9 are equally likely so that your data is spread equally across the bin partitions.
With multiple bins, it is not as easy to retrieve the x most recent tweets for a hashtag since you need to query all the bins and merge the results. You can asynchronously issue a query for each bin of a hashtag in parallel and then merge the results on the client side. Or you can do a single query using the IN clause like this:
select * from posts where hashtag = 'demo' and bin IN (0,1,2,3,4,5,6,7,8,9) AND created > ...
But Cassandra won't sort the results of the single query, so you'd have to do a sort on the client side, which is slower than doing a merge of separate ordered queries.
Now in many cases there will be hashtags that have very little volume, so you might not want to bother using ten bins for them unless they get hot. If so you can make it dynamic in your application, typically using just bin 0, but then increasing the number of bins when a tag is found to be popular. You could use a static column in bin 0 to keep track of the number of active bins for a hashtag.
You should avoid using secondary indexes. They are very inefficient in Cassandra.
Related
I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).
Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.
All
I'm implementing a kind of history table using Cassandra 2.2.
My current schema has a row key for userid, and cluster key for timestamp, then in each row is a user behavior record. I want to keep only 10 latest rows for an given userid. How can I implement this smartly?
Thanks for any suggestion!
Given a Data model of:
CREATE TABLE history (
userid text,
activity_time timeuuid,
behavior text,
PRIMARY KEY ((userid),timeuuid)
);
The best I can think of would be to do the following:
Insert all "history" records with some reasonable TTL.
How long of a TTL depends on your particular use case
When querying by a userid, limit your returned result set to 10
SELECT * FROM history WHERE userid='fromanator' LIMIT 10;
However with this approach if a user hasn't had any history within the TTL then you will get no results back. Depending on your use case this may be acceptable.
If you absolutely need to keep at least the last 10 records, then you're going to have a much more complicated data model and application code to achieve this in Cassandra.
This may not be the most elegant solution and won't strictly adhere to only storing 10 records at any given time, but you could store the row data as a list (if there is structure to the row data, you'd have to handle this structuring yourself or use user defined types). If you already have this list available to you when you write to it, you'd just truncate it to the latest 10 values before writing, otherwise you could wait until the next time a read is done on that list, truncate it to 10 records, then write that back to Cassandra.
If you're not so much concerned with how much data is stored, but rather are only interested in retrieving the last 10 results, then fromanator's solution (with or without a TTL depending on whether you care more about the size of the data or ensuring 10 results) is the best.
I'm designing a Cassandra schema for a browser event collection system, and I was hoping to sanity check my approach. The system collects user events in the browser, like mouse movements, clicks, etc. The events are stored and processed to create heat maps of user activity on a web page. I've chosen Cassandra for persistence, since my use case is more write heavy than ready heavy: every 50 milliseconds, an ajax call dumps the aggregated events to my server, and into the database. I'm using node.js for the server, and the JSON events look something like this on the server:
{ uuid: dsf86ag487hadf97hadf97, type: 'MOVE', time: 12335234345, pageX: 334, pageY:566, .... }
As you can see each user has a unique uuid, associated with each of their events, generated on the browser, stored in a cookie. My read case will be some map-reduce job. Each top-level domain will be a keyspace, and I was planning using the uuid as my partition key. The main table will be the events table, where each row will be one event, using a composite primary key, consisting of the browser-generated uuid and a cassandra-generated timeuuid. The primary key must have a timeuuid component, since two events may have the same timestamp on certain browsers. The data types for event will be strings, ints, timestamps. The total data for a partition should not exceed a few hundred megabytes. So...Is this sane? What questions should I be asking myself? I recognize that this use case has many analogs in sensor data collection, etc, so please point me to existing examples. Thanks in advance.
Choosing a partition key
While recording the user ID may be important in some cases for distinguishing events from different users that may occur at the same time, the user ID is probably not the best choice for the partition key. That is, unless you are planning to analyze the behavior of specific users.
You are probably more concerned with how the heatmap changes over time and specifically which areas of the page were involved. These are probably better considerations for your partition key, though perhaps not stored as a timestamp nor as X/Y coordinates, which I'll get into later.
You will generally want to choose a partition key that has (1) a large distribution of values, to create even load across your cluster, and (2) is made up of values that are relatively "well known". By "well known", I mean something you either know in advance or something that can be computed easily and deterministically. For instance, you will have many users and will gather statistics over many days. While the the specific of days (encoded as, say, YYYY-MM-DD strings) can be easily determined based on a known start/end date range or query input, the set of all valid user IDs (assuming UUIDs or other non-incremental value, or hash) is much harder to determine without doing a scan of the entire cluster. Avoid doing partition key scans; aim for "exact" random access to your partitions.
Format of the partition key
The partition key is traditionally shown as a single column in many examples, but you can have a multi-column partition key. This can be useful when using date/time information as all or part of the key. You would aim to have as few unique values per column as possible, so that the set of values you need to enumerate is as small as possible, but as many values (or additional columns) as necessary to balance the I/O load and data distribution across the cluster.
For example, if you were to use a timestamp as your partition key, in 64-bit Java timestamp format, there are 1,000 possible partitions per second. Even though you can technically iterate over them, that may be more granular than you need or want. On the other side, if your partition key were simply the 4-digit year, then all of that year's events would go to the same partition (making it very large) and to the same set of replica nodes (hotspots, inefficient cluster use). By choosing a key that balances between these extremes, you can control the size of your partitions and also the number of partitions you must access in order to satisfy a query.
Also consider what you'll do when you ever want to delete old data. The easiest means (within a single column family/table) is to delete an entire partition as this helps avoid accumulating individual column tombstones. If you ever want to run an operation like "delete all data older than 2013" then you definitely don't want to bury the date deep down in the data and would rather have it as part of your partition key.
Choosing a row (clustering) key
Any additional columns in the primary key that are not part of the partition key become the row key within the partition, and the rows are clustered (ordered) by the sort order of the first of these columns.
That clustering/sorting is important, because it's generally the only native sorting you're going to get with Cassandra. Even if the partition key is down to the level of a specific hour or minute of a specific day, you might choose to cluster the rows by your millisecond timestamp or time UUID, to keep everything within that partition in chronological order.
You can still have additional columns, like your X/Y coordinates or user IDs, in your row keys -- in case it sounded like I was recommending that you put time (only) in both the partition and clustering keys.
Using X/Y coordinates
This part has nothing to do with Cassandra, but if you're heat-mapping the page, do be aware that people use different screens and devices at different resolutions. Unless you're doing pixel-perfect layout on your site (and hopefully you're using a fluid, responsive layout instead) then the X/Y coordinate of one user isn't going to match the X/Y coordinates from another user. They might not even match for the same user, if that user switches devices.
Consider mapping not by X/Y coordinate of the mouse, but perhaps the IDs of elements in the DOM. Have an ID for your "sidebar", "main menu", "main body div" and any specific elements you want to map. These would be string keys, not coordinate pairs, and while they'd still be triggered on mouse enter/leave/click the logged information doesn't depend or assume any particular screen geometry.
Perhaps you decide to include the element ID as part of the row or partition key, too.
I'm having a bit of an issue with my application functionality integrating with Cassandra. I'm trying to create a content feed for my users. Users can create posts which, in turn, have the field user_id. I'm using Redis for the entire social graph and using Cassandra columns solely for objects. In Redis, user 1 has a set named user:1:followers with all of his/her follower ids. These follower ids correspond with the Cassandra ids in the users table and user_ids in the posts table.
My goal was originally to simply plug all of the user_ids from this Redis set into a query that would use FROM posts WHERE user_id IN (user_ids here) and grab all of the posts from the secondary index user_id. The issue is that Cassandra purposely does not support the IN operator in secondary indexes because that index would force Cassandra to search ALL of its nodes for that value. I'm left with only two options I can see: Either create a Redis list of user:1:follow_feed for the post IDs then search Cassandra's primary index for those posts in a single query, or keep it the way I have it now and run an individual query for every user_id in the user:1:follower set.
I'm really leaning against the first option because I already have tons and tons of graph data in Redis, and this option would add a new list for every user. The second way is far worse. I would put a massive read load on Cassandra and it would take a long time to run individual queries for a set of ids. I'm kind of stuck between a rock and a hard place, as far as I see it. Is there any way to query the secondary indexes with multiple values? If not, is there a more efficient way to load these content feeds (RAM and speed wise) compared to the options of more Redis lists or multiple Cassandra queries? Thanks in advance.
Without knowing the schema of the posts table (and preferably the others, as well), it's really hard to make any useful suggestions.
It's unclear to me why you need to have user_id be a secondary index, as opposed to your primary key.
In general it's quite useful to key content like posts off of the user that created it, since it allows you to do things like retrieve all posts (optionally over a given range, assuming they are chronologically sorted) very efficiently.
With Cassandra, if you find that a table can effectively answer some of the queries that you want to perform but not others, you are usually best of denormalizing that table and creating another table with a different structure in order to keep your queries to a single CQL partition and node.
CREATE TABLE posts (
user_id int,
post_id int,
post_text text,
PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC)
This table can answer queries such as:
select * from posts where user_id = 1234;
select * from posts where user_id = 1 and post_id = 53;
select * from posts where user_id = 1 and post_id > 5321 and post_id < 5400;
The reverse clustering on post_id is to make retrieving the most recent posts the most efficient by placing them at the beginning of the partition physically within the sstable.
In that example, user_id being a partition column, means "all cql rows with this user_id will be hashed to the same partition, and hence the same physical nodes, and eventually, the same sstables. That's why it's possible to
retrieve all posts with that user_id, as they are store contiguously
retrieve a slice of them by doing a ranged query on post_id
retrieve a single post by supplying both the partition column(user_id) and the clustering column (post_id)
In effect, this become a hashmap of a hashmap lookup. The one major caveat, though, is that when using partition and clustering columns, you always need to supply all columns from left to right in your query, without skipping any. So in this case, that means you can't retrieve an individual post without knowing the user_id that the post_id belongs to. That is addressable in user-code(by storing a reverse mapping and doing the lookup when necessary, or by encoding the user_id into the post_id that is passed around your application), but is definitely something to take into consideration.