Partition DELETE/INSERT concurrency issue in Cassandra

Partition DELETE/INSERT concurrency issue in Cassandra - apache-spark

I have a table in Cassandra which stores versions of csv-files. It uses a primary key with a unique id for the version (the partition key) and a row number (the clustering key). When I insert a new version I first execute a delete statement on the partition key I am about to insert, to clean up any incomplete data. Then the data is inserted.
Now here is the issue. Even though the delete and subsequent insert are executed synchronously after one another in the application it seems that some level of concurrency still exist in Cassandra, because when I read afterwards, rows from my insert will be missing occasionally - something like 1 in 3 times. Here are some facts:
Cassandra 3.0
Consistency ALL (R+W)
Delete using the Java Driver
Insert using the Spark-Cassandra connector
Number of nodes: 2
Replication factor: 2
The delete statement I execute looks like this:
"DELETE FROM myTable WHERE version = 'id'"
If I omit it, the problem goes away. If I insert a delay between the delete and the insert the problem is reduced (less rows missing). Initially I used a less restrictive consistency level, and I was sure this was the issue, but it didn't affect the problem. My hypothesis is that for some reason the delete statement is being sent to the replica asynchronously despite the consistency level of ALL, but I can't see why this would be the case or how to avoid it.

All mutations are going to by default get a write time of the coordinator for that write. From the docs
TIMESTAMP: sets the timestamp for the operation. If not specified,
the coordinator will use the current time (in microseconds) at the
start of statement execution as the timestamp. This is usually a
suitable default.
http://cassandra.apache.org/doc/cql3/CQL.html
Since the coordinator for different mutations can be different, a clock skew between coordinators can end up with a mutations to one machine to be skewed relative to another.
Since write time controls C* history this means you can have a driver which synchronously inserts and deletes but depending on the coordinator the delete can happen "before" the insert.
Example
Imagine two nodes A and B, B is operating with a 5 second clock skew behind A.
At time 0: You insert data to the cluster and A is chosen as the coordinator. The mutation arrives at A and A assigns a timestamp (0)
There is now a record in the cluster
INSERT VALUE AT TIME 0
Both nodes contain this message and the request returns confirming the write was successful.
At time 2: You issue a delete for the data previously inserted and B is chosen as the coordinator. B assigns a timestamp of (-3) because it is clock skewed 5 seconds behind the time in A. This means that we end up with a statement like
DELETE VALUE AT TIME -3
We acknowledge that all nodes have received this record.
Now the global consistent timeline is
DELETE VALUE AT TIME -3
INSERT VALUE AT TIME 0
Since the insertion occurs after the delete the value still exists.

I have got similar problem, and I have fixed it by enabling Light-Weight-Transaction for both INSERT and DELETE requests (for all queries actually, including UPDATE). It will make sure all queries to this partition are serialized through one "thread", so DELETE wan't overwrite INSERT. For example (assuming instance_id is a primary key):
INSERT INTO myTable (instance_id, instance_version, data) VALUES ('myinstance', 0, 'some-data') IF NOT EXISTS;
UPDATE myTable SET instance_version=1, data='some-updated-data' WHERE instance_id='myinstance' IF instance_version=0;
UPDATE myTable SET instance_version=2, data='again-some-updated-data' WHERE instance_id='myinstance' IF instance_version=1;
DELETE FROM myTable WHERE instance_id='myinstance' IF instance_version=2
//or:
DELETE FROM myTable WHERE instance_id='myinstance' IF EXISTS
IF clauses enable light-wight-transactions for each row, so all of them are serialized. Warning: LWT is more expensive than normal calls, but sometimes they are needed, like in the case of this concurrency problem.

Related

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?

As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date

To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.

You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.

Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.

First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Postgresql parallel bulk INSERT with worker don't parallelize

My scenario:
10 worker
Database has set 100 max connections
Every worker has its own DB connection (max. 10 connections)
Every worker starts a transaction (BEGIN; COMMIT;)
Every worker inserts data in the same table with bulk insert inside the transaction
Data to insert e.g. 1 million rows
Every worker handles 1000 rows (batches of size 1000)
The query of every worker:
BEGIN;
INSERT INTO "test_tbl" ("id",...) VALUES
(...),(...),...[1000 entries]... RETURNING id;
COMMIT;
Table test_tbl has only constraint PRIMARY KEY (id) with index CREATE UNIQUE INDEX formulas_pkey ON formulas USING btree (id)
Problem
After many hours of analyzing, it seams that the worker wait that another worker has finished the insert. Why the workers cannot insert new data into same table at the same time?
UPDATE
I have removed all constraints and all indices (primary keys, foreign keys, etc.) but still the same problem. No parallelization.
Added note:
Data to insert e.g. 1 million rows
Every worker handles 1000 rows (batches of size 1000)

The fact that there is a primary key means that the database has to check for the values of the corresponding column(s) to be UNIQUE and NOT NULL. The second transaction beginning to insert data cannot do it until the first one hasn't finished inserting (otherwise, there could be non-unique values).
If you just don't do the bulk insert in 1 transaction per worker (but, let's say, batches of 100 inserts), it will work much faster. You will need more calls between client and database (you will have n calls with 100 rows of data, instead of 1 very big call with n*100 rows); but the database will be able to commit much earlier.
In PostgreSQL:
reading never blocks writing and writing never blocks reading
... but transaction 1 writing can (and often will) block transaction 2 also writing.
In case you cannot do batch inserts, you can try deferring the PRIMARY KEY constraint at the end of the transaction.This is done by defining your PRIMARY KEY constraint DEFERRABLE INITIALLY DEFERRED (which is not the default for PostgreSQL, although it is the SQL standard). See the documentation for "create table":
DEFERRABLE
NOT DEFERRABLE
This controls whether the constraint can be deferred. A constraint that is not deferrable will be checked immediately after every command. Checking of constraints that are deferrable can be postponed until the end of the transaction (using the SET CONSTRAINTS command). NOT DEFERRABLE is the default. Currently, only UNIQUE, PRIMARY KEY, EXCLUDE, and REFERENCES (foreign key) constraints accept this clause.

Cassandra - order of consistency

I know that in Cassandra, there's no strong consistency unless you explicitly request it (and even then, there're no transactions).
However, I'm interested in the "order" of consistency. Take the following example:
In a database node, there are 3 nodes (A, B and C). Two insert queries are sent trough the same CQL-connection (or thrift for that matter, I don't think that's relevant to this question anyway). Both operate on different tables (this might be relevant).
INSERT INTO table_a (id) VALUES (0)
INSERT INTO table_b (id) VALUES (1)
Directly after the questions have been successfuly executed on the node that they're sent to, it goes down. The node may or may not have succeeded in propogating these two queries to B and C.
Now, I'd think that there is an order of consistency. Either both are successfully propogated and executed on B and C, or only the first query is, or both are. I'd think that, under no circumstances only the second query is propogated and executed, and not the first (because of the order of tcp packets, and the fact that obviously, all nodes share the same consistency strategy).
Am I right?

You're right, at least on the node you connect to. What happens on the server is, for a consistency level ONE write:
Receive insert to table_a
Write into commitlog
Acknowledge write to client
Receive insert to table_b
Write into commitlog
Acknowledge write to client
The key is that there is a global commitlog. So you can't flush it for one table and not another. Also, because the writes are sequential, you know the write was made to the commitlog before returning.
The commitlog gets flushed periodically (by default), so could flush after 2 but before 5, in which case only the insert to table_a is kept in the event of a crash immediately after 4 or 5.
On other nodes, the ordering isn't guaranteed, because the write is done asynchronously and writes are multithreaded. But it's not possible to totally lose the first write and not the second if the original node doesn't fail permanently.
If you want stronger guarantees, you can use Cassandra's batching.
Cassandra can guarantee that neither or both of the writes succeed if you write them as a batch. For even old Cassandra versions, if updates within a batch have the same row key (partition key in CQL speak), even if they are in different column families (tables), they will get committed to the commitlog atomically.
New in 1.2 is a batchlog across multiple rows that offers the same guarantees - either all the batch gets applied or none.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string