Cassandra : TTL vs dynamic tables vs large amount of deletes - cassandra

I have basically a data table like that (a partition id, along with a serialized value serialized_value) :
CREATE TABLE keyspace.data (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'enabled': 'true'}
AND compression = { 'class' : 'LZ4Compressor'};
The usecase involves to maintain multiples versions of the data (serialized_value for a given id).
Every day, I will have to sent into Cassandra a fresh version of the data. It involves 100 millions of rows/partitions each time.
Of course, I don't need to maintain ALL version of the data, only the last 4 days (so the four most recent version_id).
I identify three solutions to do that :
solution 1 : TTL
the idea is to set a TTL at insert time. In that a way, oldest versions of the data are automatically dropped, without having problems related to thombstones.
pros :
no read performance penalty (?)
no problem related to thombstones
cons :
if fails occur with ingestion several days, I may loose all the data from the Cassandra cluster because of TTL automatic delete
solution 2 : dynamic tables
the table creation becomes :
CREATE TABLE keyspace.data_{version_id} (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) ...;
the table name include the version_id.
pros :
the table (corresponding to a version) is easy to delete
no read performance penalty
no problem related to thombstones
cons :
dynamically adding a table to the cluster might need all the nodes to be up every time.
a bit more difficult to handle client side (query specific table name, instead of the same one)
solution 3 : large amount of deletes
in that case, all the data stay in a single table, and a version_id is added to the primary key.
CREATE TABLE keyspace.data (
version_id int,
id bigint,
serialized_value blob,
PRIMARY KEY ((version_id,id))
) ...;
pros :
only one single table to create and maintain, for the entire application lifecycle
cons :
read performance penalty may occurs because of lot of thombstones
problem related to thombstones, because large amount of data need to be deleted, in order to purge all data related to old version_id.
the delete will only match the exact partition key, so it will generate partition thombstones and NOT cell thombstones. but thus, I'm afraid of the performance of doing that..
What is the best way for you to acheive that ? :-)

It would be preferable to cluster your data based on a date or timestamp sorted in reverse order and still with TTL set. For example:
CREATE TABLE ks.blobs_by_id (
id bigint,
version timestamp,
serialized_value blob,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC)
If you have a default TTL on the table, older versions will automatically expire so when you retrieve the rows with:
SELECT ... FROM blobs_by_id WHERE id = ? LIMIT 4
Only the 4 most recent rows will be returned (in descending order) and you won't be iterating over the deleted rows. Cheers!

Related

Table layout for social app in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I was trying to see if we can avoid data de-normalization using YB’s secondary index , primary table is something like below :
CREATE TABLE posts_by_user(
user_id bigint,
post_id bigserial,
group_ids bigint[] null,
tag_ids bigint[] null,
content text null,
....
PRIMARY KEY (user_id, post_id)
)
-- there could be multiple group ids(up to 20) which user can select to publish his/her post in
-- there could be multiple tag ids(up to 20) which user can select to publish his/her post with
This structure makes fetching by user_id easier but, suppose I want to fetch by group_id(s) or tag_id(s), then either I will need to de-normalize it into secondary tables using YB transaction, which will require additional app logic and also could lead to performance issues because data will be written into multiple nodes based hash primary keys(group_ids and tag_ids).
Or I could use a secondary index to avoid writing additional logic, I have the following doubts regarding that :
YB stable version 2.8 does not allow creating a secondary index on array columns using GIN , any rough timelines it will be available as stable release version ?
will this also suffer same performance issue since multiple index will be updated at the time of client call in multiple nodes based on partition key group_id(s) or tag_id(s) ?
Other ideas are also most welcome for saving data to enable faster queries based on user_id(s), group_id(s), tag_id(s) in a scalable way.
The problem with the GIN index is that it won't be sorted on disk by the timestamp.
You have to create an index for (user_id, datetime desc).
While for groups you can maintain a separate table, with a primary key of (group_id desc, datetime desc, post_id desc). The same for tags.
And on each feed-request, you can make multiple queries for, say, 5 posts on each user_id or group_id and then merge them in the application layer.
This will be the most efficient since all records will be sorted on-disk and in-memory at write-time.

Cassandra query timeout

We are pulling data from around 20-25 industrial motor sensors and data is being stored in cassandra database.Cassandra as of now is running in a single node.
Below is the table structure
CREATE TABLE cisonpremdemo.machine_data (
id uuid PRIMARY KEY,
data_temperature bigint,
data_current bigint,
data_timestamp timestamp,
deviceid text,
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND default_time_to_live = 7884000
AND gc_grace_seconds = 100;
CREATE INDEX deviceid_idx ON db.machine_data (deviceid);
CREATE INDEX data_timestamp_idx ON db.machine_data (data_timestamp);
Data is being collected in this table for couple of months say at every 5 seconds for almost 24 hours so there is pretty huge volume of data.
I am trying to execute a date range based query using java and dotnet and in both cases i am getting time out errors (Cassandra failure during read query at consistency LocalOne (0 replica(s) responded over 1 required))
Query works fine if i give limit of 100 otherwise it fails anything above than that.Some of the things i have tried...
1) increased query time out.
2) reduced gc_grace_seconds to 100 (temporarily) to eliminate any tombstones.
Query used
SELECT data_temperature AS "DATA_TEMP",data_current AS "DATA_CURRENT" FROM machine_data
WHERE DATA_TIMESTAMP>=1517402474699
AND DATA_TIMESTAMP<=1517402774699
AND DEVICEID='BP_100' ALLOW FILTERING;
Not sure if i the table structure (primary key) is of a wrong choice. should it be both deviceid and timestamp ??
The secondary indexes will almost surely fail. They should have "not to low, not to high" cardinality (which depends on # of nodes in ring). Its very hard to get right and you should really just avoid using it unless have strong need and the data fits (cross table consistency not possible with a denormalized table).
Another thing you should never use is allow filtering, thats there pretty much just for debugging/development and large spark job kinda things that are reading entire dataset. Its horribly expensive and will almost always result in timeouts long term.
Instead you should create new tables and also break them up by time so the partitions do not get too large. ie
CREATE TABLE cisonpremdemo.machine_data_by_time (
id uuid PRIMARY KEY,
data_temperature bigint,
data_current bigint,
data_timestamp timestamp,
yymm text,
deviceid text,
PRIMARY KEY ((deviceid, yymm), data_timestamp)
) WITH CLUSTERING ORDER BY (data_timestamp DESC);
When you insert your data, write to both. You should essentially create a table for each kind of request you have, so the data is in the format you need it. Do not model your table around how the data looks. If you do not need direct message lookups by uuid, do not make the machine_data table like you have above at all since thats not how you are querying it.

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Primary key cardinality causing Partition Too Large errors?

I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.
If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.
Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?
[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.
CREATE TABLE test
(
date date,
time time,
cid text,
loc text,
src text,
dst text,
size bigint,
s_bytes bigint,
d_bytes bigint,
time_ms bigint,
log text,
PRIMARY KEY ((date, loc, cid), src, time, log)
)
WITH compression = { 'class' : 'LZ4Compressor' }
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};
I guess you meant Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large errors?
This is because of the parameter batch_size_fail_threshold_in_kb which is by default 50kB of data in a single batch - and there are also warnings earlier at a at 5Kb threshold through batch_size_warn_threshold_in_kb in cassandra.yml (see http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html).
Can you share your data model? Just adding a column doesnt mean the partition key to change - maybe you just changed the primary key only by adding a clustering column. Hint: PRIMARY KEY (a,b,c,d) uses only a as partition key, while PRIMARY KEY ((a,b),c,d) uses a,b as partition key - an easy overlooked mistake.
Apart from that, the additional column takes some space - so you can easily hit the threshold now, just reduce the batch size so it does fit again into the limits. In general it's a good way to batch only upserts the affect a single partition as you mentioned. Also make use of async queries and make parallel requests to different coordinators to gain some more speed.

Cassandra data model for application logs (billions of operations!)

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?
This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Resources