Cassandra delete/update a row and get its previous value - cassandra

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.

You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.

Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

Related

Prevent race condition while writing to Cassandra

I have a realtime streaming solution with Kafka, Spark (as the aggregation engine) and Cassandra (as the store). User defines the aggregates that are needed and the engine creates the aggregate and writes them to the store. Here is an example of how the aggregates are created
CREATE AGGR COUNT FROM input_data WHERE type,event,id
This creates a count aggregate for the 3 columns and writes to C*.
We have a requirement to process historical data as well. That means if an aggregate was created today, we need to go back and fix history for it. To cater to this use case, we have created a hvalue column in Cassandra. Here is the schema for reference
CREATE TABLE tbl (
key blob,
key2 blob,
key3 blob,
...
key15 blob,
column1 blob,
column2 blob,
...
column20 blob,
*hvalue* blob,
*value* blob,
PRIMARY KEY ((key, key2, key3 ... key15), column1 ... column20)
) WITH CLUSTERING ORDER BY (column1 ASC,column2 ASC .. column20 ASC)
value stores the facts that are computed while online processing. hvalue stores the value for historical processing. While querying, both the columns are retrieved, merged and returned to user.
We are using datastax leftJoin API to join with Cassandra.
RDD.leftJoinWithCassandraTable(keyspace,tableName)
.on(SomeColumns(...)
.map { case (ip, row) => row match {
case None => ip
case Some(data) => CASSANDRA_MAP_SCHEMA(...)
)
}
}.saveToCassandra(keyspace,tableName)
In short, we create a schema for the RDD, and write the row to Cassandra.
Now, here is the problem. During the historical process, we need to create a row to write to Cassandra. This means that we need to provide some data to the "value" column. If it is a new row that is not present in Cassandra, we create a null object and write back. If the row is present, we take the existing value and write it back.
The online and historical process will run at the same time. This means that when the historical process reads a row, and writes back, the online process may have created the same row. This will result in corrupt data, since the historical process may read a stale data and update the value that was written by the online process.
I am not sure how to resolve this problem. I'll appreciate if there is any other solutions to prevent this.
I tried to explain the best I can, let me know if further clarifications are needed and I'll try to add more inputs.
Thanks in advance for the help.
There are a few ways to work around this, but none are really simple. Fundamentally write after write problems are hard.
The first is that you introduce a shared external locking mechanism where you obtain a lock for the row and either release it when it is done or have a short ttl. You can use something like Redis for this.
A second option is to funnel all changes to Cassandra through a kafka queue so that only one source is allowed to write. Though there is a chance that this will make your problem worse. If you are going to do this, make sure that you are partitioning your queue based on keys so that the same key always routes to the same queue.
A third option is that the services are only allowed to operate on data for a given time range. If your online data is only allowed to work on data in the last day, or X hours, etc. and your historical is only allowed to work on data that is more than that period of time old then there is virtually no chance of running into conflicts.
The fourth option is to accept that it is a possibility and that the possibility of it happening is small enough that it isn't an issue. If the datacenter where your code runs is very close (ideally colocated with your db) and you aren't doing significant processing on the row between read and write this may be a reasonable option.

Does a Secondry index lock anything when it is being created?

Given the following table schema:
CREATE TABLE Record (
-- uuidv4
recordId STRING(36) NOT NULL,
-- uuidv4
userId STRING(36),
isActive BOOL
lastUpdate TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true)
...
) PRIMARY KEY (recordId)
CREATE NULL_FILTERED INDEX RecordByUser
ON Record (userId, isActive)
For every record created we make a record (in the index) to be able able to get all of a user's records by their userId. Depending on what may be needed there could be an extra STORING clause with additional information columns.
My understanding is that as I add records to the Record table, Spanner will trigger a write to the index. Since the index is non-interleaved the data itself may have a different locality to the original record.
Under that assumption, will that write to the secondary index lock the Record table until it is completed or does one not affect the other?
I'm going to guess they are totally independent since an index can be created after the fact and Spanner will trigger a backfill operation that does not affect the operational status of the Record table.
The act of writing the index has to take some resources though from the node(s) so I would imagine that is really the limitation. Under a high write scenario for the Record table, we would also be effectively invoking a second write for the Index table RecordByUser consuming a bit more of the node(s) write throughput capacity.
So the act of adding to a Secondary Index doesn't require any locking on the source table (Record in this case). The primary concern would be the write throughput and any hotspots from those writes. For example, if we indexed on a timestamp as the first part of the index, the writes to the index would bunch up. Is my understanding here correct?
During the act of creating the index on an existing table, does the backfill process hold an exclusive lock on the index, like Postgres for example:
https://www.postgresql.org/docs/current/index-locking.html
Or can new writes land in the index during the secondary index creation while backfill is taking place?
I can imagine a backfill process on spanners end of things that takes a read snapshot and starts writing. Given Spanners fancy clocks if it encounters a row in the index newer than the row it is attempting to write, it just drops the old row on the floor and carries on.
Thanks for the question. Google engineer here for the help.
+1 to chainicko# answer for the general locking mechanism. It is not "locked" in the sense that you can still read/write the original table despite the backfill is still running.
Read/query to the index itself are not allowed during the backfill. But writes to the original table are allowed. New writes are added to the index concurrently. After the backfill, Spanner will make sure only the latest data will be presented when queried.
As for the example of "indexed on a timestamp as the first part of the index", since it creates a hotspot on the index, so it would still have a negative impact on the system as a whole, even though it does not lock the original table.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

MemSQL - why can't I do a cross-database insert into .. select

I'm trying to do a simple insert with a field list from a table in one database to a table in another.
insert into db_a.target_table (field1,field2,field3) select field1,field2,field3 from db_b.source_table;
The error message seems straight-forward..
MemSQL does not support this type of query: Cross-database INSERT ... SELECT
Oddly enough, this example does work:
insert into db_a.target_table select * from db_b.source_table;
But this seems like such a common scenario. Has anyone run into a similar issue, and were you able to work around it?
Unfortunately, this isn't allowed because it is difficult to keep such queries transactional; multi-statement transactions are used internally to guarantee transactionality of the single insert-select (if one partition fails (dup key or something), we want to rollback everything!). Since we don't have cross-db multi-statement transactions (yet!), we don't have cross-db insert-select (yet!).
Stay tuned for nicer solutions.
However, if you REAAALY want to do this, here is what you do. However,
PROCEED AT YOUR OWN RISK. THIS IS NOT A SUPPORTED PROCEEDURE.
But it should work.
1) On db_b, create a table with the same columns as source_table, but make the shard key SHARD().
2) On db_a, run SHOW PARTITIONS.
3) For each of those partitions, create a connection to db_a_<ordinal> on the host and port listed in SHOW PARTITIONS. Run SHOW DATABASES on that connection and you'll see some databases called db_b_<another>. Pick one, doesn't matter which. Run INSERT INTO db_b<another>.source_table SELECT * from db_a_<ordinal>.source_table.
3.5) At this point, you haven't yet written to a table you care about, but now we will. Look at db_b.source_table. Is everything correct? Is all the data there? Run SHOW CREATE TABLE and double check the shard key is SHARD KEY () (it should be in comments). Everything look good? Ok, we can proceed.
4) After you're done doing this for EVERY partition, you can do INSERT INTO db_b.target_table (cols) SELECT cols from db_b.source_table, or whatever you want.
Good luck!

Supporting logical delete for an existing feed table

I would like to implement logical delete for a news-feed record to support a later undo.
The system is in production, so any solution should support existing data.
Insert records to the feed is idempotent, thus inserting an already deleted record (has the same primary key) should not undelete it.
Any solution should support the queries to retrieve a page of existing or deleted records.
The feed table:
CREATE TABLE my_feed (
tenant_id int,
item_id int,
created_at timestamp,
feed_data text,
PRIMARY KEY (tenant_id, created_at, feed_id) )
WITH compression = { 'sstable_compression' : 'LZ4Compressor' }
AND CLUSTERING ORDER BY (created_at DESC);
There are two approaches I have thought of but both have serious disadvantages:
1. Move deleted records to a different table. Queries are trivial and no migration is required, but idempotent inserts seems to be difficult (only read before insert?).
2. Add is_deleted column. Create a secondary index for that column to support the queries. Idempotent inserts seems to be easier to support (lightweight transactions or an update trick).
The main disadvantage is that older records have null value, thus it requires data migration.
Is there a third more elegant approach? Do you support one of the above suggestions?
If you maintain a separate table for deleted records, you can use CQL's BATCH construct to perform your "move" operation, but since the only record of deletion is in that table, you must check it first if you want the behavior you've described around not re-animating deleted records. Reading before writing is usually an anti-pattern, etc.
Using an is_deleted column might require some migration work, as you mention, but the potentially more serious problem you may have is that creating an index on a very low-cardinality column is usually extremely inefficient. With a boolean field, I think your index would contain only two rows. If you don't delete too frequently, that means your "false" row will be very wide and therefore almost useless.
If you avoid creating a secondary index for the is_deleted column and you allow both null and false to indicate active records, while only explicit true indicates deleted ones, you may not need to migrate anything. (Do you actually know which existing records to delete during migration?) You would then leave filtering deleted records to the client, who is probably already going to be in charge of some of your paging behavior. The drawback of this design is that you may have to ask for > N records to get N that aren't deleted!
I hope that helps and addresses the question as you've stated it. I would be curious to know why you would need to guard against already deleted records being brought back to life, but I can imagine a situation where you have multiple actors working on a particular feed (and the CAS problems that could arise).
On a somewhat unrelated note, you may want to consider using timeuuid instead of timestamp for your created_at field. CQL supports a dateOf() function to retrieve that date if that's a stumbling block. (It may also be impossible to get collisions within your tenant_id partitions, in which case you can safely ignore me.)

Resources