Cassandra - Data Modeling Time Series - Avoiding "Hot Spots"? - cassandra

I'm working on a Cassandra data model to store records uploaded by users.
The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).
How can I avoid having too many records on a partition key in a short amount of time?
I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.
CREATE TABLE uploads (
user_id text
,rec_id timeuuid
,rec_key text
,rec_value text
,PRIMARY KEY (user_id, rec_id)
);
The use cases are:
Get all upload records by user_id
Search for upload records by date range
range

A few possible ideas:
Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.

Related

Querying one record from tens of millions of records in Azure Table Storage

I have a typical scenario where a consumer is calling a Azure Function (EP1) (synchronously) which then queries Azure Table storage (having 5 million records), based upon the input parameters of the Azure Function API.
Azure Table Storage has following columns:
Order Number (incremental number)
IsConfirmed (can have value Y or N)
Type of Order (can be of 6 types maximum)
Order Date
Order Details
UUID
Now when consumer queries, it generally searches with the Order Number and expects the Order Date and Order Details in response, along with Order Number.
For this, we had chosen:
Partition Key: IsConfirmed + Type of Order
Row Key: UUID
Now for 5 million records search, because of the partition key type, the search partition often runs into more than 3 million records (maximum orders have IsConfirmed as Y and Type of Order a specific one among the six types) and the Table query takes more than 5 minutes.
As a result, the consumer generally times out as the wait configured on consumer side is 60 secs.
So looking for recommendation on how to do this efficiently.
Can we choose partition key as Order Number (but that will create 5 million partitions) or a combination of Order NUmber+IsConfirmed+TypeofOrder?
Ours is a write heavy Java application and READ happens much less.
+++++++++++ UPDATE +++++++++++++++
As suggested by Gaurav in the answer, after making orderid as partition key, the query is working as expected.
Now that brings to the next problem - we do have other API queries where the order data and type are only used as input search criteria.
Since this doesn't match with the partition key, so in this 2nd type of query, its basically making a whole scan and the consumer is again timed out again.
So what should be the design to handle these types of queries.. Azure doc says creating a separate table where order type + order date becomes partition key. However that will mean that whenever we are writing to the table, we will have to write on both tables (one with orderid as part key and other as order date + type as part key).
Can we choose partition key as Order Number (but that will create 5
million partitions) or a combination of Order
NUmber+IsConfirmed+TypeofOrder?
You can certainly choose partition key as order number as there is nothing wrong in having large number of partitions. However, please keep in mind that partition key value is of string type. What you may want to do is pad your order number with some character (say 0) so that all of your orders are of the same length.
In this case, I would actually recommend that you keep the row key as empty.
You may also want to think about storing multiple copies of the same data with different partition key/row key combination depending on your querying requirements. For example, if you were to query by order date, you may want to make another copy of the data with order date as the partition key.
Generally speaking it is recommended that you do point queries (query including both partition key and row key). Next best option would be to query by partition key (you would want to keep data in partition key small so that you're not doing partition scans). All other options would result in full table scan which is not at all recommended.
You may find this link useful: https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-guidelines.

Best Cassandra data model for maintaining bounded lists per user

I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100).
This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case?
Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
Default TTL set on the table.
Clustering on a time component in descending order.
Adjust query to use a range on the timestamp, never querying data past the TTL.
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';
Problem with your existing implementation is that deletes create tombstones which eventually cause latencies in the read. Creating too many tombstones is not recommended.
FIFO implementation based on count (number of rows per partition) is not possible. The better approach for your use case is not to delete records in the same table. Use Spark to migrate the table into a new temp table and remove the extra records in the migration process. Something like:
1) Create a new table
2) Using Spark , read from the orignal table , migrate all required records (filter extra records) and write to new temp table.
3) Truncate the orignal table. Note that truncate operation do not create Tombstones.
4) Migrate everything from the temp table back to orignal table using Spark.
5) Truncate the temp table.
You can do this in maintenance window of your application ( something like once in a month) until then you can restrict reads with Limit 100 per partition.

Querying split partitions on Cassandra in a single request

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

how to avoid sorting on clustering key columns in cassandra

I am a bit new to cassandra.
I have created a table like below
create table events(day text, hour text, sip text, dip text, count, counter,
primary key((day,hour), sip,dip));
our use case is, application receives many events per second. we would like to have a seprate partition per hour of a day and we need to update the counter if the same event is received again. and also we would like to have unique entries for the combination of dip and sip columns hence I have included those as part of the primary key.
Here as dip, sip columns are forming a clustering key, sorting is taking place while inserting the records into the table. In our case sorting is not required for these columns, sorting is a overhead while we include millions of rows in a table. How to avoid this sorting overhead, Can any one help me?
Ordering by clustering columns is needed for Cassandra to function correctly. It needs to store the data that way to keep the row keys unique and to support things like range queries on clustering columns. As Arun says, this allows your subsequent updates to run quickly.
You could reduce the amount of sorting by inserting rows in sorted order, for example by having the first clustering column be a time stamp. But then you'd lose the benefit of being able to increment your counter since you wouldn't know the time stamp key of the earlier event. To get the final counts you'd need to do a roll up operation after each hour to aggregate matching events.
Another way would be to make sip and/or dip part of your partition key. Each event would then hash to a different partition bucket and no sorting would be required. But then you'd lose the grouping of events into one hour partitions. This could be good or bad depending on your needs. If you have a very high rate of events, grouping them all into the same one hour partition could create hot spots since all events will hash to the same node, so making events separate partitions would spread out the write load. If reading the events later as a one hour chunk is more important to you, then having them grouped into one partition will make reading them more efficient at the cost of more expensive writes due to the sorting.
So in general, if you keep your partitions to a reasonable size, the sorting overhead should not be too large since it is done in memory. If your partitions are so large that they are causing performance problems, decrease their size by adding another field to the partition key to break the partitions into smaller chunks to spread out the load on more nodes.

Windows Azure table access latency Partition keys and row keys selection

We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.

Resources