Cassandra hard vs soft delete - cassandra

I have multiple tables that I want to keep their deleted data.
I thought of two options to achieve that:
Create new table called deleted_x and when deleting from x, immediatly insert to deleted_x.
Advantage : querying from only one table.
Disadvantages :
Do insert for each delete
When the original table structure changes, I will have to change the deleted table too.
Have a column called is_deleted and put it in the partition key in each of these tables and set it to true when deleting a row.
Advantage : One table structure
Disadvantage : mention is_deleted in all queries from table
Are there any performence considerations I should think of additionally?
Which way is the better way?

Option #1 is awkward, but it's probably the right way to do things in Cassandra. You could issue the two mutations (one DELETE, and one INSERT) in a single batch, and guarantee that both are written.
Option #2 isn't really as easy as you may expect if you're coming from a relational background, because adding an is_deleted column to a table in Cassandra and expecting to be able to query against it isn't trivial. The primary reason is that Cassandra performs significantly better when querying against the primary key (partition key(s) + optional clustering key(s) than secondary indexes. Therefore, for maximum performance, you'd need to model this as a clustering key - doing so then prohibits you from simply issuing an update - you'd need to delete + insert, anyway.
Option #2 becomes somewhat more viable in 3.0+ with Materialized Views - if you're looking at Cassandra 3.0+, it may be worth considering.

Are there any performence considerations I should think of additionally?
You will effectively double the write load and storage size for your cluster by inserting your data twice. This includes compactions, repairs, bootstrapping new nodes and backups.
Which way is the better way?
Let me suggest a 3rd option instead.
Create table all_data that contains each row and will never be deleted from
Create table active_data using the same partition key. This table will only contain non-deleted rows (Edit: but not any data at all, just the key!).
Check if key is in active_data before reading from all_data will allow you to only read non-deleted rows

Related

Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field?

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Regarding Cassandra's (sloppy, still confusing) documentation on keys, partitions

I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).

Supporting logical delete for an existing feed table

I would like to implement logical delete for a news-feed record to support a later undo.
The system is in production, so any solution should support existing data.
Insert records to the feed is idempotent, thus inserting an already deleted record (has the same primary key) should not undelete it.
Any solution should support the queries to retrieve a page of existing or deleted records.
The feed table:
CREATE TABLE my_feed (
tenant_id int,
item_id int,
created_at timestamp,
feed_data text,
PRIMARY KEY (tenant_id, created_at, feed_id) )
WITH compression = { 'sstable_compression' : 'LZ4Compressor' }
AND CLUSTERING ORDER BY (created_at DESC);
There are two approaches I have thought of but both have serious disadvantages:
1. Move deleted records to a different table. Queries are trivial and no migration is required, but idempotent inserts seems to be difficult (only read before insert?).
2. Add is_deleted column. Create a secondary index for that column to support the queries. Idempotent inserts seems to be easier to support (lightweight transactions or an update trick).
The main disadvantage is that older records have null value, thus it requires data migration.
Is there a third more elegant approach? Do you support one of the above suggestions?
If you maintain a separate table for deleted records, you can use CQL's BATCH construct to perform your "move" operation, but since the only record of deletion is in that table, you must check it first if you want the behavior you've described around not re-animating deleted records. Reading before writing is usually an anti-pattern, etc.
Using an is_deleted column might require some migration work, as you mention, but the potentially more serious problem you may have is that creating an index on a very low-cardinality column is usually extremely inefficient. With a boolean field, I think your index would contain only two rows. If you don't delete too frequently, that means your "false" row will be very wide and therefore almost useless.
If you avoid creating a secondary index for the is_deleted column and you allow both null and false to indicate active records, while only explicit true indicates deleted ones, you may not need to migrate anything. (Do you actually know which existing records to delete during migration?) You would then leave filtering deleted records to the client, who is probably already going to be in charge of some of your paging behavior. The drawback of this design is that you may have to ask for > N records to get N that aren't deleted!
I hope that helps and addresses the question as you've stated it. I would be curious to know why you would need to guard against already deleted records being brought back to life, but I can imagine a situation where you have multiple actors working on a particular feed (and the CAS problems that could arise).
On a somewhat unrelated note, you may want to consider using timeuuid instead of timestamp for your created_at field. CQL supports a dateOf() function to retrieve that date if that's a stumbling block. (It may also be impossible to get collisions within your tenant_id partitions, in which case you can safely ignore me.)

Resources