Cassandra data model for time series data - cassandra

For monitoring some distributed software I insert their monitoring data into Cassandra table. The columns are metric_type, metric_value, host_name, component_type and time_stamp. The scenario is I collect all the metrics for all the nodes in every second. The time in uniform for all nodes and their metrics. The keys(that differentiate rows) are host_name, component_type, metric_type and time_stamp. I design my table like below:
CREATE TABLE metrics (
component_type text,
host_name text,
metric_type text,
time_stamp bigint,
metric_value text,
PRIMARY KEY ((component_type, host_name, metric_type), general_timestamp)
) WITH CLUSTERING ORDER BY (time_stamp DESC)
where component_type, host_name and metric_type are partitions key and time_stamp is clustering key.
The metrics table is suitable for the queries that gets some data according to their timestamp just for a host_name or a metric_type or a component_type, as using partition keys Cassandra will find the partition that data are stored and using clustering key will fetch data from that partition and this is the optimal case for Cassandra queries.
Besides that, I need a query that fetches all data just using time_stamp. For example :
SELECT * from metrics WHERE time_stamp >= 1529632009872 and time_stamp < 1539632009872 ;
I know the metric table is not optimal for the above query, because it should search every partition to fetch data. I guess in this situation we should design another table with the time_stamp as partition key, so data will be fetched from one or some limited number of partitions. But I am not certain about some aspects:
Is it optimal to set time_stamp as partition key? because of I insert data into the database every second and the partition key numbers will be a lot!
I need my queries to be interval on time_stamp and I know interval conditions are not allowed in partition keys, just allowed on clustering keys!
So what is the best Cassandra data model for such time series data and query?

Using time_stamp as partition key is not optimal in my opinion, as it would create a lot of partitions.
I would propose 2 solutions:
1) Go with a "week_first_day" as partition key. You would have to compute the correct week_first_day keys on your application side and then emit multiple select queries.
2) You could use ElasticSearch on top of cassandra. Cassandra remains the primary data source, but you have the freedom, to do complex selects. If you are interested, I would recommend to take a look at Elassandra .

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Cassandra Data modelling with multiple tables for same data

Cassandra Data Modeling Query
Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visits, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1:
I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +c ampaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partition. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change? For example in Table 2 and Table 3 the hash will be on enterprise_id and will lead to same node. However only the clustering key order has changed and will allow me to query directly on the required key. Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
Is it ok to use ALLOW FILTERING if i specify the partition key. For example i can avoid the need for creating Table 3 and use table 2 for query on subscription_id directly if i use ALLOW FILTERING on Table 2. What will be the impact again.
First of all, please only as one question per question. Given the length and detail required for your answers, this post is unlikely to provide long term value for future users.
As per my reading it is not ok to have so many entries per partition. How can I achieve my use case in this scenario?
Unfortunately, if partitioning on a time component will not work, then you'll have to find some other column to partition the data by. I've seen rows-per-partition work ok in the range of 50k to 20k. Most of those use cases on the higher end had small partitions. It looks like your model has many columns, so I'd be curious as to the average partition size. Essentially, find a column to partition on which keeps your partition sizes in the 10MB to 1MB range.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change?
Yes, this is perfectly fine.
Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
The partition is hashed into a number ranging from +/- 2^63. That number will then be compared to the partition ranges mapped to all nodes, and then the query will be sent to that node. So all the partition does, is determine which node is responsible for the data.
The tables have their data files written to different directories, based on table name. So Cassandra distinguishes between the tables by the table name provided in the query. Nothing you need to worry about.
Is it ok to use ALLOW FILTERING if I specify the partition key.
I would still recommend against it if you're concerned about performance. But the good thing about using the ALLOW FILTERING directive while specifying a full partition key, will indeed prevent Cassandra from reading multiple nodes to build the result set. So that should be ok. The only drawback here, is that Cassandra stores/reads data from disk by the defined CLUSTERING ORDER, and using ALLOW FILTERING obviously complicates that process (forcing random reads vs. sequential reads).

Should every table in Cassandra have a partition key?

I am trying to create a Cassandra table where i store the logs for a shop as per the timestamp. I also want to create a query which returns the data in a descending order with respect to the timestamp. If I make my timestamp as the primary key it will be automatically be the partition key as i don't have any other columns as composite primary key.
And in Cassandra we can't do ORDER BY on partition keys. Is there any way that I make my timestamp as primary key and not as partition key (A Cassandra DB without a partition key).
Thanks in advance.
table creation if required :
CREATE TABLE myCass.logs(timestamp timestamp, logs text, PRIMARY KEY (timestamp));
Since you have the timestamp you know the year, month, day. You could use those as your partition key and have the timestamp as a clustering column. In this way you would satisfy also the need for a partition key, you will have a primary key for the data, you could order by on timestamps and you would evenly spread your data across the cluster.
This way of splitting data is called bucketing. Here is some good reading on this subject - Cassandra Time Series Data Modeling For Massive Scale

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

How do secondary indexes work in Cassandra?

Suppose I have a column family:
CREATE TABLE update_audit (
scopeid bigint,
formid bigint,
time timestamp,
record_link_id bigint,
ipaddress text,
user_zuid bigint,
value text,
PRIMARY KEY ((scopeid, formid), time)
) WITH CLUSTERING ORDER BY (time DESC)
With two secondary indexes, where record_link_id is a high-cardinality column:
CREATE INDEX update_audit_id_idx ON update_audit (record_link_id);
CREATE INDEX update_audit_user_zuid_idx ON update_audit (user_zuid);
According to my knowledge Cassandra will create two hidden column families like so:
CREATE TABLE update_audit_id_idx(
record_link_id bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((record_link_id), scopeid, formid, time)
);
CREATE TABLE update_audit_user_zuid_idx(
user_zuid bigint,
scopeid bigint,
formid bigint,
time timestamp
PRIMARY KEY ((user_zuid), scopeid, formid, time)
);
Cassandra secondary indexes are implemented as local indexes rather than being distributed like normal tables. Each node only stores an index for the data it stores.
Consider the following query:
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How will this query execute 'under the hood' in Cassandra?
How will a high-cardinality column index (record_link_id) affect its performance?
Will Cassandra touch all nodes for the above query? Why?
Which criteria will be executed first, base table partition_key or secondary index partition_key? How will Cassandra intersect these two results?
select * from update_audit where scopeid=35 and formid=78005 and record_link_id=9897;
How the above query will work internally in cassandra?
Essentially, all data for partition scopeid=35 and formid=78005 will be returned, and then filtered by the record_link_id index. It will look for the record_link_id entry for 9897, and attempt to match-up entries that match the rows returned where scopeid=35 and formid=78005. The intersection of the rows for the partition keys and the index keys will be returned.
How high-cardinality column (record_link_id)index will affect the query performance for the above query?
High-cardinality indexes essentially create a row for (almost) each entry in the main table. Performance is affected, because Cassandra is designed to perform sequential reads for query results. An index query essentially forces Cassandra to perform random reads. As cardinality of your indexed value increases, so does the time it takes to find the queried value.
Does cassandra will touch all nodes for the above query? WHY?
No. It should only touch a node that is responsible for the scopeid=35 and formid=78005 partition. Indexes likewise are stored locally, only contain entries that are valid for the local node.
creating index over high-cardinality columns will be the fastest and best data model
The problem here is that approach does not scale, and will be slow if update_audit is a large dataset. MVP Richard Low has a great article on secondary indexes(The Sweet Spot For Cassandra Secondary Indexing), and particularly on this point:
If your table was significantly larger than memory, a query would be very slow even to return just a few thousand results. Returning potentially millions of users would be disastrous even though it would appear to be an efficient query.
...
In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index.
Now, your approach of first restricting by a specific partition will help (as your partition should certainly fit into memory). But I feel the better-performing choice here would be to make record_link_id a clustering key, instead of relying on a secondary index.
Edit
How does having index on low cardinality index when there are millions of users scale even when we provide the primary key
It will depend on how wide your rows are. The tricky thing about extremely low cardinality indexes, is that the % of rows returned is usually greater. For instance, consider a wide-row users table. You restrict by the partition key in your query, but there are still 10,000 rows returned. If your index is on something like gender, your query will have to filter-out about half of those rows, which won't perform well.
Secondary indexes tend to work best on (for lack of a better description) "middle of the road" cardinality. Using the above example of a wide-row users table, an index on country or state should perform much better than an index on gender (assuming that most of those users don't all live in the same country or state).
Edit 20180913
For your answer to 1st question "How the above query will work internally in cassandra?", do you know what's the behavior when query with pagination?
Consider the following diagram, taken from the Java Driver documentation (v3.6):
Basically, paging will cause the query to break itself up and return to the cluster for the next iteration of results. It'd be less likely to timeout, but performance will trend downward, proportional to the size of the total result set and the number of nodes in the cluster.
TL;DR; The more requested results spread over more nodes, the longer it will take.
Query with only secondary index is also possible in Cassandra 2.x
select * from update_audit where record_link_id=9897;
But this has a large impact on fetching data, because it reads all partitions on distributed environment. The data fetched by this query is also not consistent and could not relay on it.
Suggestion:
Use of Secondary index is considered to be a DIRT query from NoSQL Data Model view.
To avoid secondary index, we could create a new table and copy data to it. Since this is a query of the application, Tables are derived from queries.

Resources