If I have 500k rows to delete, should I form a batch of 100 rows for delete? i.e. 100 rows at a time?
What is the performance characteristics? Other than network round trip, would the server be benefited from the batching?
Thanks
Short answer-- you're most likely better off with simple, non-batched async operations.
The batch keyword in Cassandra is not a performance optimization for batching together large buckets of data for bulk loads.
Batches are used to group together atomic operations, actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.
Using batches will probably not make your mass ingestion/or deletes run faster
Okay but what if I use an Unlogged Batch? Will that run super fast?
Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.
There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:
Read this post
Writes and deletes are the same thing so you should expect the same performance characteristics. I would expect some slight benefits from batching but normal async operations should be just as fast.
Related
Since the latency with batch processing generates when accumulating a specific number of data, can I regard batch processing with the "size of one" as stream processing? Or there's other difference when operators do calculations?
For example, if I set the batch size of a spark-based program to 1, can I make its latency as low as flink?
One of my thinking is as below:
For stream processing, one data flows from former operator to latter one if processed, but for batch process, only after all the operator finish processing a data can it accept another data.
It seems the pipeline in stream processing counts for the acceleration.
Am I right in my explantion? If wrong, what's the appropriate explanation to my question.
TLDR: there are quite a lot of reasons why you should help your program and tell it explicitly wether you want a bounded(batch) or unbounded(stream) computation.
Your thinking is good in theory, but that's not how it works in practice: batch vs stream setting is being asked explicitly from the programmer. The runtime won't infer it from the batch size (or batch delay) you set. At least that's how Flink works.
Furthermore, the batch vs stream divide goes much deeper: batch shouldn't care much about time.
Let's say you increase the batch size to be the whole dataset size. Only in that case Flink will be able to apply performance optimization passes over your plan. For example: in streaming mode JOINs need to keep both sides in memory in case a match appears on the other side. In batching mode, Flink knows both sides are fixed-size, it can materialize first the smallest side and only keep that in memory while it queries it with the other side. Thus Flinks need less memory for batching, and it uses CPU caches better (which makes for a faster processing).
Also streaming has to maintain watermarks (special row metadata to help with correlating the right rows together time-wise, persisting coherent set of rows together, etc), while batch doesn't care about them. That's overhead.
If you're up for it you can peruse the Flink source code, and compare the Batch vs Stream SQL optimization rules. You'll see that stream has to deal with watermarks (FlinkLogicalWatermarkAssigner) when batch does not, it has to expand temporal tables fully (LogicalCorrelateToJoinFromTemporalTableRule). Batch also can sort rows and do sort-merge-joins (BatchPhysicalSortMergeJoinRule). Stream has to incrementally process aggregates (IncrementalAggregateRule) when batch can do them locally at the data source (PushLocalHashAggIntoScanRule), etc. Each difference between these two files is either a thing one side has to specifically do because of its (batch vs stream) nature, or an optimization pass that is allowed by its (batch vs stream) nature.
If you would like to know more about this topic and it's numerous subtleties, you can also read the Flink Blog posts, Flink Documentation, Flink Improvement Proposals
I understand that Scylla allows batch statements like these.
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
These statements have performance implications as it ensures atomicity. However, I simply have many insert statements which I want to perform from my node client in a single IO. Atomicity among these inserts is not needed. Any idea how I can do that? Can't find anything.
Batching multiple inserts in Cassandra world usually is an antipattern (except when they go into one partition, see the docs). When you're sending inserts into multiple partitions in one batch, the coordinator node will need to take care for taking data from this batch and sending them to nodes that are owning the data. And this puts an additional load onto the coordinating node that first needs to backup the content of the batch just not to lose it if it crashes in the middle of execution, and then need to execute all operations, and wait for results of execution before sending it back to caller (see this diagram to understand how so-called logged batch works).
When you don't need atomicity, then the best performance would be by sending multiple parallel inserts, and waiting for their execution - it will be faster, it will put less load onto nodes, and driver can use token-aware load balancing policy, so requests will be sent to nodes that own data (if you're using prepared statements). In node.js you can achieve this by using Concurrent Execution API - there are several variants of its usage, so it's better to look into the documentation to select what is best for your use case.
In the doc we can find a query hint named USE_ADDITIONAL_PARALLELISM here: https://cloud.google.com/spanner/docs/query-syntax#statement-hints
However the documentation is very short for it.
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
In what scenario would we use it?
What is its impact on the infrastructure?
How does it scale with number of nodes?
Does it need a query that picks data from different splits, or does it work on a single split?
Any meaningful information about it is welcome.
PS: I was originally introduced to the hint in this thread
A Spanner query may be executed on multiple remote servers.
Source: An illustration of the life of a query from the Cloud Spanner "Query execution plans" documentation
The root node coordinates the query execution.
If the execution plan expects rows on multiple splits to satisfy the query predicate(s), multiple subplans are executed on the respective remote servers.
Due to the distributed nature of Spanner these subplans can sometimes be executed in parallel; for example, the right subplan execution is not dependent on the left subplan results.
If the USE_ADDITIONAL_PARALLELISM query hint is provided, the root node may choose to increase the number of parallel remote executions, if the execution plan includes multiple subplans.
To answer the original questions:
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
This hint does not change how a query is executed, it only make it possible for subplans of that execution to be initiated with increased parallelism.
In what scenario would we use it?
Especially in cases when a full table scale is required, this may lead to faster, in wall-time, query completion, but the trade offs concerning resource allocation, and the affects on other parallel operations, should also be considered.
What is its impact on the infrastructure?
If an increased number of remote executions are run in parallel, the average CPU for the instance may increase.
How does it scale with number of nodes?
An increased number of nodes provides additional capacity for parallel operations.
Does it need a query that picks data from different splits, or does it work on a single split?
Benefits will likely be significantly higher for queries which require data that resides on multiple splits.
A Cloud Spanner query may have multiple levels of distribution. The USE_ADDITIONAL_PARALLELISM query hint will cause a node executing a query to try and prefetch the results of subqueries further up in the distribution queue. This can be useful in scenarios such as queries doing full table scans or doing full table scans with aggregations like COUNT(), MAX , MIN etc. where identical subqueries can be distributed to many splits and where the individual subqueries to the splits return relatively little data (such as aggregation state). However, if the individual subqueries return significant data then using this hint can cause memory usage on the consuming node to go up significantly due to prefetching.
I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6
I understand the basic difference between LOGGED and UNLOGGED batches in Cassandra in terms of atomicity. Essentially, LOGGED batches are atomic while UNLOGGED are not. This means that all statements in a LOGGED batch get executed (or not executed) all together.
In the case of an UNLOGGED batch, if something goes wrong during the write operation of a composing statement, I know that the already executed statements are NOT rolledback, but does Cassandra notify the failure of the whole batch to the driver ?
So logged batches use a log to record the batch operation and then execute it, removing it from the log when it is successful. Unlogged is still a batch operation but without the overhead of the log. In small amounts logged are ok but as you scale up this batch log could grow and become a problem point. The Datastax docs actually cover batching and some examples:
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useBatch.html
Example of good batches
Example of bad batches
Generally speaking batches have their use but I have seen them cause performance problems when overused because of the penalty you pay for grouping them up on a coordinator node. I often point folks to this well known blog that outlines some useful info on batches too