What is the memory impact of using object-sizeof dependency - node.js

I am using object-sizeof dependency to know about the batch size.
Here is my array of batch insert queries to be executed on cassandra client:
var queries = [...]
var size_in_bytes = require('object-sizeof')(queries);`
I would like to reject those batch which will cause cassandra down.
else
Can I re-chunk those batch into smaller sizes and run those queries ?
What would be the better approach ?

I would disperse data throughout several batches, especially if you target more than one partition per batch. Mutations on more than one partition will have a negative impact on performance.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useBatch.html

Change the value for batch_size_fail_threshold_in_kb in the cassandra.yaml file.

Related

Can I increase batch_size_fail_threshold to 1MB in Cassandra?

Can I increase "batch size fail threshold" to 1MB in Cassandra?
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
option in your cassandra.yaml to increase it, but be sure to test to make sure your actually helping and not hurting you're throughput.
Technically, you can set it to whatever size you want but it would be a bad idea.
A CQL BATCH is there to provide a means for atomic updates of a single partition across multiple tables. It is NOT an optimisation in the same way as batches are in traditional relational databases.
When you place multiple partitions in a CQL batch, the performance of that batch will be worse than issuing multiple separate write requests. We don't recommend it and it is bad practice. Cheers!

How can I make a non atomic batch (or equivalent) statement in cassandra

I am using the DataStax Nodejs driver from Cassandra and what I want to do is to avoid very frequent I/O operations that will happen for inserts in my application. I will be doing around 1000 inserts per second and want to group all together and perform 1 I/O instead of running individual queries which will cause 1000 I/Os. I came across batch statements like below,
const query1 = 'UPDATE user_profiles SET email = ? WHERE key = ?';
const query2 = 'INSERT INTO user_track (key, text, date) VALUES (?, ?, ?)';
const queries = [
{ query: query1, params: [emailAddress, 'hendrix'] },
{ query: query2, params: ['hendrix', 'Changed email', new Date()] }
];
client.batch(queries, { prepare: true }, function (err) {
// All queries have been executed successfully
// Or none of the changes have been applied, check err
});
The problem here is that they are atomic. I want other statements to be successful even if one of them fail. Is there something that I can do to achieve that ?
Batch statement across multiple partitions (which is the case with your write statements) are by default using LOGGED batch. This means that you have this atomicity property. If you really want to remove the atomicity part, you should use UNLOGGED batch. You should be aware, however, that UNLOGGED batch across multiple partitions is an anti-pattern https://issues.apache.org/jira/browse/CASSANDRA-9282. Let me try to explain:
When using batch statement, you have 4 possible cases:
is your batch against a single partition, or multiple partitions?
(which is your case)
is your batch using LOGGED or UNLOGGED batch? LOGGED ensure atomicity (all or none operation will succeed). LOGGED bath are more costly.
Let's consider the 4 options:
single partition, LOGGED batch. You use this when you want to achieve atomicity of your writes against the single partition. This atomicity has a cost. So use that only if you need it.
single partition, UNLOGGED batch. You use this when you don't need atomicity, it is faster. If your application is correctly configured (tokenaware), your batch statement will choose a replica (for this partition) as coordinator, and you will have a performance boost. That's the only legitimate reason to use UNLOGGED batch. By default, batch against the same partition is UNLOGGED.
multiple partitions, LOGGED batch. The only reason to batch queries hitting different partitions is to ensure atomicity. By default, batch against multiple partitions is LOGGED.
multiple partitions, UNLOGGED batch. This is an anti-pattern because it brings no functional value (no atomicity), and no performance benefit (multiple partitions are involved, the coordinator will have the overhead of contacting replicas that are responsible for the partition, inducing extra work).
To make it more concrete, when you issue what you call 'a single IO' batch statement across multiple partitions, the coordinator will have to slice your 'single IO' into 1000 of IO anyway (it wouldn't be the case if all the write were on the same partition), and coordinate that accross multiple replicas.
To conclude, you might observe a perf improvement on your client side, but you will induce a much larger cost at the Cassandra side.
You might want to read the following blog post: http://batey.info/cassandra-anti-pattern-misuse-of.html
and in particular, the section cometing the use of UNLOGGED batch against multiple partitions:
What this is actually doing is putting a huge amount of pressure on a single coordinator. This is because the coordinator needs to forward each individual insert to the correct replicas. You're losing all the benefit of token aware load balancing policy as you're inserting different partitions in a single round trip to the database.

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6

Cassandra batch size

I have 6 Cassandra nodes mainly used for writing (95%).
What's the best approach to inserting data - individual inserts or batches? reason says batches are to be used, while keeping the "batch size" under 5kb to avoid node instability:
https://issues.apache.org/jira/browse/CASSANDRA-6487
Do these 5kb concern the size of the queries, as in number of chars * bytes_per_char? are there any performance drawbacks to fully running individual inserts?
Batch will increase performance if used for a single partition. You are able to get more through put of data inserted.

datastax cassandra java driver batch delete performance behavior

If I have 500k rows to delete, should I form a batch of 100 rows for delete? i.e. 100 rows at a time?
What is the performance characteristics? Other than network round trip, would the server be benefited from the batching?
Thanks
Short answer-- you're most likely better off with simple, non-batched async operations.
The batch keyword in Cassandra is not a performance optimization for batching together large buckets of data for bulk loads.
Batches are used to group together atomic operations, actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.
Using batches will probably not make your mass ingestion/or deletes run faster
Okay but what if I use an Unlogged Batch? Will that run super fast?
Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.
There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:
Read this post
Writes and deletes are the same thing so you should expect the same performance characteristics. I would expect some slight benefits from batching but normal async operations should be just as fast.

Resources