Cassandra data model for application logs (billions of operations!) - cassandra

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?

This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Related

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

Resources