We have a table where we need to process a group of deletes atomically (spread across partition keys). Would it be preferable to do with with a LOGGED BATCH of separate DELETE statements, or one DELETE with the keys to be deleted given in a WHERE ... IN (...) clause?
None of those solutions is optimal (BATCH, IN), depending on your queries the load is moved from your clients to the coordinator nodes. You want to use parallel queries at client side if possible to distribute the load across nodes.
Now, some insights for question regarding BATCH and IN:
In both case you want to limit the number of statements per group (~20 ish)
If you want to group deletes, try to delete the largest block like one-go partition (to create partition-level tombstones).
Logged batch will retry failed statements until batch timeout, it could create more load at the coordinator nodes than IN. The atomicity is only ensure at partition level, if you use cross partitions delete use IN.
Related
Looking to reclaim space on a large table. The table has old data which is no longer required and can be deleted. The deletes are based on partition key, there are about 500k partition keys to be deleted.
Would it be better to run the deletes in batches say 50k or 100k in one go? what might be a better batch size (batch here implying how many deletes can be run in one go)?
If the deletes are being run from cqlsh, will cqlsh act as client and connect to diff nodes as coordinator node for each delete or will the node from where cqlsh is started acts as co-ordinator node and all the deletes fired from there?
what are the best practices to run massive deletes/cleanups? any specific dos and donts?
First thing that you need to remember in Cassandra is that deletes really increase disk consumption, not decreasing it, until the compaction happens and old data is deleted. The Last Pickle has a great blog post on that topic.
Regarding your questions:
Batches on different partition keys are heavily increasing a pressure onto coordinator node, so they aren't recommended, especially such big. Prefer to delete one by one
cqlsh always sends commands to the same host (this is enforced by WhiteListPolicy) that acts as coordinator that then forwards traffic to node owning that data.
I would recommend to use external tool, either Spark + Spark Cassandra Connector, or you can use DSBulk to perform deletes as well, by using a custom query, something like this (assuming that you have CSV file with all values for partition column(s) that you want to delete - :pk the name of the column in the header of CSV file, and pk - name of partition column in your schema):
dsbulk load -query "DELETE FROM ks.table WHERE pk = :pk"
In this case DSBulk will correctly send data directly to nodes owning the data, avoiding the pressure on coordinator node.
I am using the DataStax Nodejs driver from Cassandra and what I want to do is to avoid very frequent I/O operations that will happen for inserts in my application. I will be doing around 1000 inserts per second and want to group all together and perform 1 I/O instead of running individual queries which will cause 1000 I/Os. I came across batch statements like below,
const query1 = 'UPDATE user_profiles SET email = ? WHERE key = ?';
const query2 = 'INSERT INTO user_track (key, text, date) VALUES (?, ?, ?)';
const queries = [
{ query: query1, params: [emailAddress, 'hendrix'] },
{ query: query2, params: ['hendrix', 'Changed email', new Date()] }
];
client.batch(queries, { prepare: true }, function (err) {
// All queries have been executed successfully
// Or none of the changes have been applied, check err
});
The problem here is that they are atomic. I want other statements to be successful even if one of them fail. Is there something that I can do to achieve that ?
Batch statement across multiple partitions (which is the case with your write statements) are by default using LOGGED batch. This means that you have this atomicity property. If you really want to remove the atomicity part, you should use UNLOGGED batch. You should be aware, however, that UNLOGGED batch across multiple partitions is an anti-pattern https://issues.apache.org/jira/browse/CASSANDRA-9282. Let me try to explain:
When using batch statement, you have 4 possible cases:
is your batch against a single partition, or multiple partitions?
(which is your case)
is your batch using LOGGED or UNLOGGED batch? LOGGED ensure atomicity (all or none operation will succeed). LOGGED bath are more costly.
Let's consider the 4 options:
single partition, LOGGED batch. You use this when you want to achieve atomicity of your writes against the single partition. This atomicity has a cost. So use that only if you need it.
single partition, UNLOGGED batch. You use this when you don't need atomicity, it is faster. If your application is correctly configured (tokenaware), your batch statement will choose a replica (for this partition) as coordinator, and you will have a performance boost. That's the only legitimate reason to use UNLOGGED batch. By default, batch against the same partition is UNLOGGED.
multiple partitions, LOGGED batch. The only reason to batch queries hitting different partitions is to ensure atomicity. By default, batch against multiple partitions is LOGGED.
multiple partitions, UNLOGGED batch. This is an anti-pattern because it brings no functional value (no atomicity), and no performance benefit (multiple partitions are involved, the coordinator will have the overhead of contacting replicas that are responsible for the partition, inducing extra work).
To make it more concrete, when you issue what you call 'a single IO' batch statement across multiple partitions, the coordinator will have to slice your 'single IO' into 1000 of IO anyway (it wouldn't be the case if all the write were on the same partition), and coordinate that accross multiple replicas.
To conclude, you might observe a perf improvement on your client side, but you will induce a much larger cost at the Cassandra side.
You might want to read the following blog post: http://batey.info/cassandra-anti-pattern-misuse-of.html
and in particular, the section cometing the use of UNLOGGED batch against multiple partitions:
What this is actually doing is putting a huge amount of pressure on a single coordinator. This is because the coordinator needs to forward each individual insert to the correct replicas. You're losing all the benefit of token aware load balancing policy as you're inserting different partitions in a single round trip to the database.
When a new member joins a cluster, table repartitioning and data merge will happen.
If the data is large, I believe it will take some time. While it is happening, what is the state of the cache like?
If I am using embedded mode, does it block my application until the merging is completed? or if I don't want to work with an incomplete cache, do I need to wait (somehow) before starting my application operations?
Partition migration will start as soon as the member joins the cluster. It will not block your application because it will progress asynchronously in the background.
Only mutating operations that fall into a migrating partition are blocked. Read-only operations are not blocked.
Mutating operations will get PartitionMigrationException which is a RetryableHazelcastException so they will be retried for default 2 minutes. If you have small partition sizes, then migration of a partition will last shorter. You can increase partition count via system property hazelcast.partition.count.
If you want to block your application until all migrations finish, you can check isClusterSafe method to make sure there are no migrating partitions in the cluster. But beware that isClusterSafe returns the status of the cluster rather than current member so it might not be something to rely on. Instead, I would recommend not to block the application while partitions are migrating.
I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6
When I'm sending batch of inserts to only one table while each row as a unique key with condition if not exists and there is a problem when even if one of the rows exists.
I need to insert the batch per row and not per the whole batch.
Let's say I've a table "users" with only one column "user_name" and contains the row "jhon", Now I'm trying to import new users:
BEGIN BATCH
INSERT INTO "users" ("user_name") VALUES ("jhon") IF NOT EXISTS;
INSERT INTO "users" ("user_name") VALUES ("mandy") IF NOT EXISTS;
APPLY BATCH;
It will not insert "mandy" because that "jhon" exists, What can I do to isolate them?
I've a lot of rows to insert about 100-200K so I need to use batch.
Thanks!
First: what you describe is documented as intended behavior:
In Cassandra 2.0.6 and later, you can batch conditional updates introduced as lightweight transactions in Cassandra 2.0. Only updates made to the same partition can be included in the batch because the underlying Paxos implementation works at the granularity of the partition. You can group updates that have conditions with those that do not, but when a single statement in a batch uses a condition, the entire batch is committed using a single Paxos proposal, as if all of the conditions contained in the batch apply.
That basically confirms: your updates are to different partitions, so only one Paxos proposal is going to be used, which means the entire batch will succeed, or none of it will.
That said, with Cassandra, batches aren't meant to speed up and bulk load - they're meant to create pseudo-atomic logical operations. From http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html :
Batches are often mistakenly used in an attempt to optimize performance. Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
The coordinator node might also need to work hard to process a logged batch while maintaining consistency between tables. For example, upon receiving a batch, the coordinator node sends batch logs to two other nodes. In the event of a coordinator failure, the other nodes retry the batch. The entire cluster is affected. Use a logged batch to synchronize tables, as shown in this example:
In your schema, each INSERT is to a different partition, which is going to add a LOT of load on your coordinator.
You can run your 200k inserts with a client with async executes, and they'll run quite fast - probably as fast (or faster) as you'd see with a batch.