How to model time series data in cassandra when data has non-uniform generation rate? - cassandra

I am planning to migrate data from my existing database (Postgres) to Cassandra. Here is a brief overview of the system:
Current data set size is around 2 Billion
Each data point represents an event. Properties of this event are - user_id, event_name, timestamp
This data is coming from a finite set of sources - For the sake of simplicity let's assume 3 different sources S1, S2, S3 - all of them pushing in a Kafka Topic. This cassandra microservice is consuming data from this topic.
The rate of data coming from S1, S2 and S3 is different. Assume S1 is pushing 1 event for a single user every minute, S2 is pushing 1 event for each user every 15 minutes and S3 is pushing single event for each user every 1 hour.
There are two types of queries this system should support
Get latest event for a given user
Get list of events for a given user and date range (This data range can have diff of at most 30 days)
I am trying to model this data using few different approaches.
Partition data for a single user into monthly buckets. For this additional parameters timestamp_year, timestamp_month are added. timestamp is used a cluster key.
Pros: Less than 10ms write latency. Max partition size is around ~60MB (working good for cassandra 3.11). Get latest event is working in less than 10ms (99.999 percentile).
Cons: Getting month level data is slow because of too much data being read from a single partition. If i put limit on number of records being fetched (let's say 10000) the latency improves. Partition size is non-uniform because of different rate of data from 3 different sources.
I have tried using weekly buckets instead of monthly buckets and pagination to improve on other parameters. But this is something i am not able to sort out Partition size is non-uniform because of different rate of data from 3 different sources.
How can i keep partition size consistent (almost) in such a data model? Ideas are welcome.

This is a classical problem and there are no easy solutions to make partition size uniform. If you can predict the rate of ingestion per user, probably you can have different buckets of users, such as, high, medium and low ingestion users.
Depending on the type, the time bucket would be different. For a high ingestion user, partition means a day and for a low ingestion user, partition means a month.
For speeding up your month query on a high ingestion user, you can run parallel queries of 30 days and see if it helps to optimize your query time.

Related

Delta Lake partitioning strategy for event data

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app.
Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions?
Is there a better way to partition this data?
Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?
It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)

Cassandra aggregation

The Cassandra database is not very good for aggregation and that is why I decided to do the aggregation before write. I am storing some data (eg. transaction) for each user which I am aggregating by hour. That means for one user there will be only one row for each our.
Whenever I receive new data, I read the row for current hour, aggregate it with received data and write it back.I use this data to generate hourly reports.
This works fine with low velocity data but I observed considerably high data loss when velocity is very high (eg 100 records for 1 user in a min). This is because reads and writes are happening very fast and because of "delayed write", I am not getting updated data.
I think my approach "aggregate before write" itself is wrong. I was thinking about UDF but I am not sure how will it impact on performance.
What is the best way to store aggregated data in Cassandra ?
My idea would be:
Model data in Cassandra on hour-by-hour buckets.
Store plain data into Cassandra immediately when they arrive.
Process at X all the data of the X-1 hour and store the aggregate result on another table
This would allow you to have very fast incoming rates, process data only once, store the aggregates into another table to have fast reads.
I use Cassandra to pre-aggregate also. I have different tables for hourly, daily, weekly, and monthly. I think you are probably getting data loss as you are selecting the data before your last inserts have replicated to other nodes.
Look into the counter data type to get around this.
You may also be able to specify a higher consistency level in either the inserts or selects to ensure you're getting the most recent data.

What is the best data model for timeseries in Cassandra when *fast sequential reads* are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.
If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Resources