Clickhouse maximum amount of materialized views per base table - statistics

Is there any limits or recommendation for maximum amount of materialized views per base table?
I aim to generate several statistics reports from a table containing raw data (which contains up to 1B records). Instead of generating a single materialized view for all kinds of reports, I want to create smaller MVs (eg hourly statistics, daily statistics, statistics broken down by location/device/demographic/etc). Right now it seems to be around 12 MVs, but it will grow in future, while adding new reports.
Will CH handle this? Or should I look for different approach to achieve my goal? Unfortunately I couldn’t find anything answering this questions in the docs.

No any rules. Every system is uniq.
I would say that 10 MV is over too many.
But MV slow down inserts. If your ingestion is OK then 12 is OK for you.
hourly statistics, daily statistics, statistics broken down by location/device/demographic/etc)
From my experience I would create 2 MV:
hourly statistics with dimensions: location, device, demographic
weekly statistics with dimensions: location, device, demographic
Also check projections https://www.youtube.com/watch?v=jJ5VuLr2k5k
Maybe you can create projections against your main table. OR maybe it has sense to create projections against MV.

Related

Provisioned write capacity in Cassandra

I need to capture time-series sensor data in Cassandra. The best practices for handling time-series data in DynamoDB is as follow:
Create one table per time period, provisioned with write capacity less than 1,000 write capacity units (WCUs).
Before the end of each time period, prebuild the table for the next period.
As soon as a table is no longer being written to, reduce its provisioned write capacity. Also reduce the provisioned read capacity of earlier tables as they age, and archive or delete the ones whose contents will rarely or never be needed.
Now I am wondering how I can implement the same concept in Cassandra! Is there any way to manually configure write/read capacity in Cassandra as well?
This really depends on your own requirements that you need to discuss with development, etc.
There are several ways to handle time-series data in Cassandra:
Have one table for everything. As Chris mentioned, just include the time component into partition key, like a day, and store data per sensor/day. If the data won't be updated, and you know in advance how long they will be kept, so you can set TTL to data, then you can use TimeWindowCompactionStrategy. Advantage of this approach is that you have only one table and don't need to maintain multiple tables - that's make easier for development and maintenance.
The same approach as you described - create a separate table for period of time, like a month, and write data into them. In this case you can effectively drop the whole table when data "expires". Using this approach you can update data if necessary, and don't require to set TTL on data. But this requires more work for development and ops teams as you need to reach multiple tables. Also, take into account that there are some limits on the number of tables in the cluster - it's recommended not to have more than 200 tables as every table requires a memory to keep metadata, etc. Although, some things, like, a bloom filter, could be tuned to occupy less memory for tables that are rarely read.
For cassandra just make a single table but include some time period in the partition key (so the partitions do not grow indefinitely and get too large). No table maintenance and read/write capacity is really more dependent on workload and schema, size of cluster etc but shouldn't really need to be worried about except for sizing the cluster.

What is the best data model for timeseries in Cassandra when *fast sequential reads* are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.
If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

Cassandra database design - 1000 columns or dynamically created tables

I wanted to hear your advice about a potential solution for an advertise agency database.
We want to build a system that will be able to track users in a way that we know
what they did on the ads, and where.
There are many type of ads, and some of them also FORMS, so user can fill data.
Each form is different but we dont want to create table per form.
We thought of creating a very WIDE table with 1k columns, dozens for each type, and store the data.
In short:
Use Cassandra;
Create daily tables so data will be stored on a daily table;
Each table will have 1000 cols (100 for datetime, 100 for int, etc).
Application logic will map the data into relevant cols so we will be able to search and update those later.
What do you think of this ?
Be careful with generating tables dynamically in Cassandra. You will start to have problems when you have too many tables because there is a per table memory overhead. Per Jonathan Ellis:
Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
Even daily tables are not a good idea in Cassandra (tables per form is even worse). I recommend you build a table that can hold all your data and you know will scale well -- verify this with cassandra-stress.
At this point, heed mikea's advice and start thinking about your access patterns (see Patrick's video series), you may have to build additional tables to meet your querying needs.
Note: For anyone wishing for a schemaless option in c*:
https://blog.compose.io/schema-less-is-usually-a-lie/
http://rustyrazorblade.com/2014/07/the-myth-of-schema-less/

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Resources