Querying split partitions on Cassandra in a single request - cassandra

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.

Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

Related

What is the cardinality of a partition key?

If I use a randomly generated unique Id , is it correct that
the cardinality would be rather large ?
If I have a key with a low cardinality like 5 category values that the partition key can take, and I want to distribute it, the recommended approach seems to be to make partition key into composite key.
But this requires that I have to specify all the parts of a composite key in my query to retrieve all records of that key.
Even then the generated token might end up being for the same node.
Is there any way to decide on a the additional column for composite key to that would guarantee that the data would be distributed ?
The thing is that with cassandra you actually want to have partitioning keys "known" so that you can access the data when you need it. I'm not sure what you mean when you say large cardinality on partitioning key. You would get a lot of partitions in the cluster. This is usually o.k.
If you want to distribute the data around the cluster. You can use artificial columns. And this approach is sometimes also called bucketing. Basically if you want to keep 100k+ or in never version 1 million+ columns it's o.k. to split this data into partitions.
Some people simply use a trick and when they insert the data they add some artificial bucket column to partition ... let's say random(1-10) and then when they are reading the data out they simply issue 10 queries or use an in operator and then fetch the data and merge it on the client side. This approach has many benefits in that it prevents appearance of "hot rows" in the cluster.
Chances for every key are more or less 1/NUM_NODES that it will end on the same node. So I would say most of the time this is not something you should worry about too much. Unless you have number of partitions that is smaller then the number of nodes in the cluster.
Basically there are two choices for additional column random (already described) or some function based on some input data i.e. when using time series data and you decide to bucket based on the month you can always calculate the month based on the data that you are going to insert and then you just put it in bucket. When you are retrieving the data then you know ... o.k. I'm looking something in May 2016 and then you know how to select the appropriate bucket.

Cassandra - Data Modeling Time Series - Avoiding "Hot Spots"?

I'm working on a Cassandra data model to store records uploaded by users.
The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).
How can I avoid having too many records on a partition key in a short amount of time?
I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.
CREATE TABLE uploads (
user_id text
,rec_id timeuuid
,rec_key text
,rec_value text
,PRIMARY KEY (user_id, rec_id)
);
The use cases are:
Get all upload records by user_id
Search for upload records by date range
range
A few possible ideas:
Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.

how to avoid sorting on clustering key columns in cassandra

I am a bit new to cassandra.
I have created a table like below
create table events(day text, hour text, sip text, dip text, count, counter,
primary key((day,hour), sip,dip));
our use case is, application receives many events per second. we would like to have a seprate partition per hour of a day and we need to update the counter if the same event is received again. and also we would like to have unique entries for the combination of dip and sip columns hence I have included those as part of the primary key.
Here as dip, sip columns are forming a clustering key, sorting is taking place while inserting the records into the table. In our case sorting is not required for these columns, sorting is a overhead while we include millions of rows in a table. How to avoid this sorting overhead, Can any one help me?
Ordering by clustering columns is needed for Cassandra to function correctly. It needs to store the data that way to keep the row keys unique and to support things like range queries on clustering columns. As Arun says, this allows your subsequent updates to run quickly.
You could reduce the amount of sorting by inserting rows in sorted order, for example by having the first clustering column be a time stamp. But then you'd lose the benefit of being able to increment your counter since you wouldn't know the time stamp key of the earlier event. To get the final counts you'd need to do a roll up operation after each hour to aggregate matching events.
Another way would be to make sip and/or dip part of your partition key. Each event would then hash to a different partition bucket and no sorting would be required. But then you'd lose the grouping of events into one hour partitions. This could be good or bad depending on your needs. If you have a very high rate of events, grouping them all into the same one hour partition could create hot spots since all events will hash to the same node, so making events separate partitions would spread out the write load. If reading the events later as a one hour chunk is more important to you, then having them grouped into one partition will make reading them more efficient at the cost of more expensive writes due to the sorting.
So in general, if you keep your partitions to a reasonable size, the sorting overhead should not be too large since it is done in memory. If your partitions are so large that they are causing performance problems, decrease their size by adding another field to the partition key to break the partitions into smaller chunks to spread out the load on more nodes.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

Designing timeseries database in Cassandra

I am looking at creating a Cassandra timeseries database for storing millions of series of daily data that can potentially have altogether up to 100B data points.
I looked at this article:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
This design is very sound. So essentially I can put the daily timestamps as columns and if necessary shard the columns by appending the day to the row.
Two questions I have:
I am looking at storing up to 20,000 timestamped (daily) columns. Is it even necessary to shard rows by eg. year with this amount of columns? Is there any advantage/disadvantage to sharding rows to reduce the number of columns down to 365 per year.
Another idea I have is to rather than sharding columns by row is to create column family per each year. This way when accessing the data from multiple years I would have to query multiple column families rather than one column family and join the results on the client side. Would this approach speed things up or rather slow everything down?
If you are ever going to manage huge quantities of writes there is one problem with your approach.
Writing always to 1 key means that all writes for that key will go to one node. Basically you will use one node per day out of your cluster, so you might as well have one huge instance of Cassandra rather than bother setting up a cluster.
If your write frequency gets really high you might bring down the nodes responsible for that day/key.
My advise is to bucket one day in multiple rows that are used simultaneously. Time bucketing could be dangerous as a sudden surge during one bucket could bring everything down.
you could create your bucket (row key) like this :
[ROW_BASE_NAME] + [DAY] + someHashFunction(timestamp) % 10
[ROW_BASE_NAME] + [DAY] + random.nextInt(10)
[ROW_BASE_NAME] + [DAY] + nextbucket <--- that is if you have a secure way to rotate the bucket yourself
There is many ways to do it. You could also use some element of the column being saved to do that.
But I think it should be important to do that in order to leverage the whole cassandra cluster at all times.
My answer is only valid for Write heavy application/functionality since you will have to use a multi_get (multiple keys whole row reads) to read all the data and reconstitute the whole time line for that day.

Resources