Cassandra Data Modelling - Identifying Best Row Key - cassandra

I have a platform where various apps put notes, the notes are identified by note_id and apps are identified by app_key, both note_id and app_key are unique, and all my queries are confined to single app key only, I won't need to query for notes across multiple apps.
Now I have to choose a primary key.
If I choose only app_key as partition key and note_id as clustering key, there will be wide rows. That is all the notes of a single app will be grouped against app_key in a single partition.
So:
Find all notes of an app will be efficient (Single partition seek).
Find one note of an app will be efficient.
Delete all notes of an app will be efficient.
Delete one note of an app is efficient.
However there is no guarantee how wide a row will be, i.e. no limit on number of notes a single app can have. The data distribution will be uneven. All notes of an app will be in a single partition, so an app having huge number of notes will create a huge partition resulting in hotspots.
Now lets check option B, partion key will be both app_key and note_key
In this case partition count for an app will depend on the number of notes it will have
Find all notes of an app (Not possible)
Find one note of an app (Efficient assuming seeking to a partition is fast)
Delete all notes of an app (Not possible)
Delete a single note is fast (Assuming the same as above)
So my questions are:
What is the correct balance here?
Am I missing any concepts?
Do the hotspots really matter?
As in the 2nd option an entire query is not possible, are there any alternatives to model this?

My recommendation would be you divide you partition in time based bucket (eg: daily/weekly/monthly/yearly) based on through put so that you don't suffer from wide row partition..
For example in case of daily partition your partition key will be (app-key, insert_day)..here insert_day is date eg 8-8-2018-00:00:00:000 ....
Now when it comes to read all notes by app-key you need to iterate from current day till days when you no more find data.. same goes with delete.. chose bucket so that it reduces number of iterations.
The note-id (clustering key) you can take it of type time-uuid (which will be generated from insert-date)..now when it comes to select by note-id and appkey.. you can calculate the required insert-day from node-id value (ie, note-id -> insert-date ->insert-day)

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

nosql separate data by client

I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.

Azure table delete pattern - delete old items

I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:
Partition Key - domain name
Row key - Webpage address
Valid until (date time) - after this date, the record will be deleted.
Other crucial data here...
Those columns will be stored in a table called as the website address (e.g. "cnn.com").
I have two main use case (high to low):
1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient.
2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.
So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.
I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?
I also thought about other implementation way and it's looks like they have some problems:
Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).
What is the right solution for my problem?
Have you seen the Azure Table Storage Design Guide? It describes principles and patterns for designing tables solutions at scale. For hot spots take a look at the prepend / append anti-pattern for some extra information. This is where all your operations occur within a single partition which prevents additional resources from being added. For these types of scenarios you will get better scale if you can distribute the operations across partitions instead.
Let's assume you have a site https://www.yahoo.com/news/death-omar-al-shishani-could-mean-war-against-203132664.html?nhp=1. You can keep PK as domainName + "/news/" + 2 letters of page address, summary https://www.yahoo.com/news/de. RK - other part of the full address. This will split your domain partition on near 1000 partitions. If that's not enough - use 3 first letter in PK.
Remove obsolete data every 15 minutes (create a separate service for it). Your millions will became just tens of thousands. Or keep less data (2 weeks instead of month for.ex.). And do not forget optimize deletion (get PK and RK only, update ETag to "*", remove as DynamicTableEntity, batch if possible).

Defining a partition key in Cassandra

I'm playing around with Cassandra for the first time and I feel like I understand the basics and limits. I'm working with the following model, as an example, for storing tweets collected by hashtag.
create table posts
(
id text,
status text,
service text,
hashtag text,
username text,
caption text,
image text,
link text,
repost boolean,
created timestamp,
primary key (hashtag, created)
);
This works very well for the type of query I need:
select * from posts where hashtag = 'demo' order by created desc;
However, if I understand things correctly, there is an upper limit to the number of posts I could store using the singular 'demo' partition key and more importantly, the entire set of posts matching the 'demo' partition key would have to be stored with each replica. I'd should probably use a more random or variable partition key (maybe the id of the post) if I understand correctly, but I don't know what to use that won't alter the requirements for the query.
If I use id as the partition key (e.g. PRIMARY KEY (id, created)) and add a secondary index on the hashtag column, I get the following error when I run my query:
ORDER BY with 2ndary indexes is not supported.
I get that to use ORDER BY, the partition key must be featured in the where clause, hence my original thought to use hashtag.
Am I overthinking things or is there a better candidate for the partition key?
The direction you go would depend on what volume of writes you expect and how big your cluster is.
If you have a small user community and a small cluster, then you might be overthinking things. A partition can theoretically hold up to 2 billion rows. That's a big number, and would anyone actually want to view more than a few thousand of the most recent tweets for a hashtag? So you'd probably have some kind of cleanup mechanism such as using TTL to delete tweets after some amount of time, which will free up space in the partition, keeping you well below the 2 billion row limit.
If you don't want to cleanup up old tweets, but want to preserve them for many years, then you might want to use a compound partition key like this:
primary key ((hashtag, year), created)
This would partition the tweets by the tag and the year, so you could store up to 2 billion tweets per tag per year.
The nice thing about partitioning by hashtag is that Cassandra can keep the tweets for a tag sorted by the creation timestamp, making it easy to retrieve the most recent ones with a single query as you've shown.
But if your user community is big, then the issue that is of a bigger concern is avoiding hot spots. If you use just hashtag and a time bin like year for a partition key, then all reads and writes will be to the small number of replicas for that hashtag. If a hashtag is very active on a given day, then you've got all your reads and writes going to just a node or two depending on what replication factor you are using.
If you want to spread out the read and write load, you need to increase the cardinality of a hashtag so that it will map to multiple nodes. Using id as the partition key would achieve this, but that would be going too far since then every tweet would be in a separate partition and you'd get no sorting or easy way to retrieve the most recent tweets for a hashtag.
So a better approach is to create separate bins or buckets, like this:
primary key ((hashtag, bin), created)
The number of bins you create depends on your write load. Let's say you decide that ten nodes can handle the write load for a hot hashtag, then bin would be a value from 0 to 9.
There are a number of ways to set the bin number. You could do a modulo of id by 10, or pick a random number between 0 and 9, or generate a hash value from some combination of fields and take modulo 10 of the results. Whatever method you choose, make sure the numbers from 0 to 9 are equally likely so that your data is spread equally across the bin partitions.
With multiple bins, it is not as easy to retrieve the x most recent tweets for a hashtag since you need to query all the bins and merge the results. You can asynchronously issue a query for each bin of a hashtag in parallel and then merge the results on the client side. Or you can do a single query using the IN clause like this:
select * from posts where hashtag = 'demo' and bin IN (0,1,2,3,4,5,6,7,8,9) AND created > ...
But Cassandra won't sort the results of the single query, so you'd have to do a sort on the client side, which is slower than doing a merge of separate ordered queries.
Now in many cases there will be hashtags that have very little volume, so you might not want to bother using ten bins for them unless they get hot. If so you can make it dynamic in your application, typically using just bin 0, but then increasing the number of bins when a tag is found to be popular. You could use a static column in bin 0 to keep track of the number of active bins for a hashtag.
You should avoid using secondary indexes. They are very inefficient in Cassandra.

Cassandra schema advice needed

I'm designing a Cassandra schema for a browser event collection system, and I was hoping to sanity check my approach. The system collects user events in the browser, like mouse movements, clicks, etc. The events are stored and processed to create heat maps of user activity on a web page. I've chosen Cassandra for persistence, since my use case is more write heavy than ready heavy: every 50 milliseconds, an ajax call dumps the aggregated events to my server, and into the database. I'm using node.js for the server, and the JSON events look something like this on the server:
{ uuid: dsf86ag487hadf97hadf97, type: 'MOVE', time: 12335234345, pageX: 334, pageY:566, .... }
As you can see each user has a unique uuid, associated with each of their events, generated on the browser, stored in a cookie. My read case will be some map-reduce job. Each top-level domain will be a keyspace, and I was planning using the uuid as my partition key. The main table will be the events table, where each row will be one event, using a composite primary key, consisting of the browser-generated uuid and a cassandra-generated timeuuid. The primary key must have a timeuuid component, since two events may have the same timestamp on certain browsers. The data types for event will be strings, ints, timestamps. The total data for a partition should not exceed a few hundred megabytes. So...Is this sane? What questions should I be asking myself? I recognize that this use case has many analogs in sensor data collection, etc, so please point me to existing examples. Thanks in advance.
Choosing a partition key
While recording the user ID may be important in some cases for distinguishing events from different users that may occur at the same time, the user ID is probably not the best choice for the partition key. That is, unless you are planning to analyze the behavior of specific users.
You are probably more concerned with how the heatmap changes over time and specifically which areas of the page were involved. These are probably better considerations for your partition key, though perhaps not stored as a timestamp nor as X/Y coordinates, which I'll get into later.
You will generally want to choose a partition key that has (1) a large distribution of values, to create even load across your cluster, and (2) is made up of values that are relatively "well known". By "well known", I mean something you either know in advance or something that can be computed easily and deterministically. For instance, you will have many users and will gather statistics over many days. While the the specific of days (encoded as, say, YYYY-MM-DD strings) can be easily determined based on a known start/end date range or query input, the set of all valid user IDs (assuming UUIDs or other non-incremental value, or hash) is much harder to determine without doing a scan of the entire cluster. Avoid doing partition key scans; aim for "exact" random access to your partitions.
Format of the partition key
The partition key is traditionally shown as a single column in many examples, but you can have a multi-column partition key. This can be useful when using date/time information as all or part of the key. You would aim to have as few unique values per column as possible, so that the set of values you need to enumerate is as small as possible, but as many values (or additional columns) as necessary to balance the I/O load and data distribution across the cluster.
For example, if you were to use a timestamp as your partition key, in 64-bit Java timestamp format, there are 1,000 possible partitions per second. Even though you can technically iterate over them, that may be more granular than you need or want. On the other side, if your partition key were simply the 4-digit year, then all of that year's events would go to the same partition (making it very large) and to the same set of replica nodes (hotspots, inefficient cluster use). By choosing a key that balances between these extremes, you can control the size of your partitions and also the number of partitions you must access in order to satisfy a query.
Also consider what you'll do when you ever want to delete old data. The easiest means (within a single column family/table) is to delete an entire partition as this helps avoid accumulating individual column tombstones. If you ever want to run an operation like "delete all data older than 2013" then you definitely don't want to bury the date deep down in the data and would rather have it as part of your partition key.
Choosing a row (clustering) key
Any additional columns in the primary key that are not part of the partition key become the row key within the partition, and the rows are clustered (ordered) by the sort order of the first of these columns.
That clustering/sorting is important, because it's generally the only native sorting you're going to get with Cassandra. Even if the partition key is down to the level of a specific hour or minute of a specific day, you might choose to cluster the rows by your millisecond timestamp or time UUID, to keep everything within that partition in chronological order.
You can still have additional columns, like your X/Y coordinates or user IDs, in your row keys -- in case it sounded like I was recommending that you put time (only) in both the partition and clustering keys.
Using X/Y coordinates
This part has nothing to do with Cassandra, but if you're heat-mapping the page, do be aware that people use different screens and devices at different resolutions. Unless you're doing pixel-perfect layout on your site (and hopefully you're using a fluid, responsive layout instead) then the X/Y coordinate of one user isn't going to match the X/Y coordinates from another user. They might not even match for the same user, if that user switches devices.
Consider mapping not by X/Y coordinate of the mouse, but perhaps the IDs of elements in the DOM. Have an ID for your "sidebar", "main menu", "main body div" and any specific elements you want to map. These would be string keys, not coordinate pairs, and while they'd still be triggered on mouse enter/leave/click the logged information doesn't depend or assume any particular screen geometry.
Perhaps you decide to include the element ID as part of the row or partition key, too.

Resources