Cassandra Time Series Data Modelling and Limiting Partition Size - cassandra

We are currently investigating Cassandra as the database for a large time series system.
I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.
What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.
We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.
We would like there to be a single row for each (weather_station_id, year, day_of_year), that is, a new row for every day. However, we still want the partition key to be weather_station_id - that is, we want all readings for a station to be on the same node.
We currently have the following schema, but I would like to get some feedback.
CREATE TABLE weather_station_data (
weather_station_id int,
year int,
day_of_year int,
time timestamp,
sensor_id int,
temperature int,
humidity int,
light int,
PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC, sensor_id DESC);
In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.

According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted.
i.e C* has a practical limitation of 2 billion columns per partition.
So IMO, your data-model is bad.
However, it is unclear to me whether or not the date in their examples is part of the partition key.
The tutorial used
PRIMARY KEY ((weatherstation_id,date),event_time)
So, yes they considered data to be part of partition key.
we want all readings for a station to be on the same node.
I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.
select * from weather_station_data where weather_station_id=1234 and year= 2013;
select * from weather_station_data where weather_station_id=1234 and year= 2014;
So consider changing your structure to
PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)
Hope it helps!

In my opinion the datastax model isn't really great. The problem with this model:
They are using the weather station as partition key. All rows with the same partition key are stored on the same machine. This means: If you have 10 years raw data (100ms steps), you will break cassandras limit really fast. 10 years × 365 days × 24 hours × 60 min × 60 seconds x 10 (for 100ms steps) x 7 columns. The limit is 2 Billion. In my opinion you will not use the benefits of cassandra if you build this data model. You can also use, for each weather station, a mongo, mysql or another database.
The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)
One little more tipp: Cassandra isn't like mysql. It's a denormalized database. This means: It's not dirty to save your data more than one time. This means: It is important for your to query your data per year, it's also important to query your data per hour, per day of year or per sensor_id, you can create column families with different partition key and parimary key order. It's okay to duplicate your data. Cassandra is optimized for write performance, not for read. This means: It's often better to write the data in the right order instead of reading it in the right order. In cassandra 3.0 there is a new feature, called materialized views, for automatic duplicating. And if you think: Ohhh no, i will duplicate the needed storage. Remember: Storage is really cheap. It's okay to buy ten HDDs with 1tb. It cost nothing. The performance is important.
I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.

Related

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Purge old data strategy for Cassandra DB

We store events in multiple tables depending on category.
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
NO delete - simply ignore the problem (somebody else's problem :0)
Rate limited single row delete (do complete table scan and potentially billions of delete statements)
Split the table to multiple tables -> "CREATE TABLE eventlookupYYYY". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?
If it's worth the extra space, track for ranges of recordtimes your subelement_id in a seperate table / columnfamiliy.
Then you can easily get the ids to delete for records having a specific age if you do not want to set a ttl a priori.
But keep in mind to make this tracking distribute well, just a single date will generate hotspots in your cluster and very wide rows, so think about some partition key like (date,chunk) where I uses a random number from 0-10 in the past for chunk. Also you might look at TimeWindowCompactionStrategy - here is a blog post about it: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
Your partition key is only set to subelement_id, so all tuples of 7 events for all recordtimes will be in one partition.
Given your table structure, you need to know all the subelement_id of all your data just to fetch a single row. So, with this assumption, your table structure can be improved a bit by sorting your data by recordtime DESC:
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
eventtype int,
parentid text,
partition bigint,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
WITH CLUSTERING ORDER BY (recordtime DESC);
Now all of your data is in descending order and this will give you a big advantage.
Suppose that you have multiple years of data (eg from 2000 to 2018). Assuming you need to keep only the last 5 years, you'd need to fetch data by something like:
SELECT * FROM eventlookup WHERE subelement_id = 'mysub_id' AND recordtime >= '2013-01-01';
This query is efficient because C* will retrieve your data and will stop scanning the partition exactly where you wanted to: 5 years ago. The big plus is that if you have tombstones after that point, well, they won't impact your reads at all. That means you can "safely" trim after that point safely by issuing a delete with
WHERE subelement_id = 'mysub_id' AND recordtime < '2013-01-01';
Beware that this delete will create tombstones that will be skipped by your reads, BUT they will be read during compactions, so keep it in mind.
Alternatively, you can simply skip the delete part if you don't need to reclaim your storage space, your system will always run smooth because you will always retrieve your data efficiently.

Cassandra time series table design for timestamp range queries

Our problem is a bit different from a usual timeseries problem as we do not have natural partition key in our data. In our system we get not more than 5k/s messages, so following many publications (like this one) we figured out a following schema (it's more complex but the below matters most):
CREATE TABLE IF NOT EXISTS test.messages (
date TEXT,
hour INT,
createdAt TIMESTAMP,
uuid UUID,
data TEXT,
PRIMARY KEY ((date, hour), createdAt, uuid)
)
We mostly want to query the system based on the creation (event) time; other filtering will likely be done on different engines like Spark. The problem is that we may have a query that spans e.g. two months, so ideally we should put 60+ dates and 24 hours in the WHERE-IN-part of query, which is cumbersome to say the least. Of course, we can execute queries like below:
SELECT * FROM messages WHERE createdat >= '2017-03-01 00:00:00' LIMIT 10 ALLOW FILTERING;
My understanding is that, while the above works, it will make a full scan, which will be expensive on larger cluster. Or am I mistaken and C* knows, which partitions to scan?
I was thinking to add an index, but this problem likely falls into high-cardinality antipattern, as I understand.
EDIT: the question is not that much about the data model, though suggestions are welcome, but more about feasibility of making the queries with cratedat range instead or listing all date and hour values required in WHERE-IN-part of query to avoid full scans.

Is Cassandra a good choice for this kind of time series data vs sql server?

Consider this scenario, we collect financial market data (e.g. price of fund) and store it in a sql table.
Normally Fund prices at most once a day, so the table can be:
FundId Date Price1 Price2
When we want some data, a simple query will do:
select Date, Price1, Price2 from FundPriceTable where Date between XX and XX
However, as we gathered more and more data, the above query performance began to go down. We tried re-indexing and refresh stats. The problem is that when we retrieve vast amount of data (e.g. get 10 year history for 1000 fund), it can take quite a while.
I am wondering for this scenario (no join at all), will system like Cassandra show any performance benefit (assume same hardware)?
I tried to find some benchmark articles between Cassandra and sql server for time series, unfortunately didn't find anything.
Depends on your schema. The performance boost depends on your partition key. In your example:
You can split your data by day or month. This example is spitted by month:
fundPricesByDay (month int, timestamp timestamp, productId text, Price1 float, Price2, PRIMARY KEY(month, timestamp, productId))
If you need all data between the first and third month, you can execute 3 queries:
select * from fundPricesByDay where month = 1 AND timestamp > 60000;
select * from fundPricesByDay where month = 2;
select * from fundPricesByDay where month = 3 AND timestamp < 99999;
With these three queries you will get all data between timestamp 60000 and 99999. But you execute all queries on different vNodes. This means each node has to handle fewer rows than sql. It makes a performance boost. Read a bit more about how Cassandra works and you will understand how you can boost your tables.
You ask explicitly for the same hardware. Maybe there's no performance boost. Simply benchmark it. But Cassandra will definitely win in the combination of scalability and performance. SQL has its limit (depends on the hardware, clustering is possible but complicated to implement and has also its limitations), Cassandra does not have this limitations in scalability and performance. (or better: It's really difficult to hit a limit when you have a good schema.)

Querying split partitions on Cassandra in a single request

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

Resources