batch size of prepared statement in spring data cassandra - cassandra

I'm getting this warning in the log:
WARN [Native-Transport-Requests:17058] 2014-07-29 13:58:33,776 BatchStatement.java (line 223) Batch of prepared statements for [keyspace.tablex] is of size 10924, exceeding specified threshold of 5120 by 5804.
Is there a way in spring data cassandra to specify the size?
Cassandra 2.0.9 and spring data cassandra 1.0.0-RELEASE

This is just a warning, informing you that the query size exceeds certain limit.
The query is still being processed. The reasoning behind is that bigger batched queries are expensive and may cause cluster imbalance. Therefore warning you (the developer) beforehand.
Look for batch_size_warn_threshold_in_kb in cassandra.yaml to adjust when should this warning be produced.
Here is the ticket where it was introduced: https://issues.apache.org/jira/browse/CASSANDRA-6487

I have done extensive performance testing and tuning on Cassandra, working closely withe DataStax Support.
That is why I created the ingest() methods in SDC*, which are super fast in 1.0.4.RELEASE and higher.
This method caches the PreparedStatement for you, and then loops over the individual Bind values and calls executeAsync for each insert. This sounds counter intuitive, but is the fastest (and most balanced) way to insert into Cassandra.

Related

Could my large amount of tables (2k+) be causing my write timeout exceptions?

I'm running OS Cassandra 3.11.9 with Datastax Java Driver 3.8.0. I have a Cassandra keyspace that has multiple tables functioning as lookup tables / search indices. Whenever I receive a new POST request to my endpoint, I parse the object and insert it in the corresponding Cassandra table. I also put inserts to each corresponding lookup table. (10-20 per object)
When ingesting a lot of data into the system, I've been running into WriteTimeoutExceptions in the driver.
I tried to serialize the insert requests into the lookup tables by introducing Apache Camel and putting all the Statements into a queue that the Session could work off of, but it did not help.
With Camel, since the exceptions are now happening in the Camel thread, the test continues to run, instead of failing on the first exception. Eventually, the test seems to crash Cassandra. (Nothing in the Cassandra logs though)
I also tried to turn off my lookup tables and instead insert into the main table 15x per object (to simulate a similar number of writes as if I had the lookup tables on). This test passed with no exception, which makes me think the large number of tables is the problem.
Is a large number (2k+) of Cassandra tables a code smell? Should we rearchitect or just throw more resources at it? Nothing indicative has shown in the logs, mostly just some status about the number of tables etc - no exceptions)
Can the Datastax Java Driver be used multithreaded like this? It says it is threadsafe.
There is a direct effect of the high number of tables onto the performance - see this doc (the whole series is good source of information), and this blog post for more details. Basically, with ~1000 tables, you get ~20-25% degradation of performance.
That's could be a reason, not completely direct, but related. For each table, Cassandra needs to allocate memory, have a part for it in the memtable, keep information about it, etc. This specific problem could come from the blocked memtable flushes, or something like. Check the nodetool tpstats and nodetool tablestats for blocked or pending memtable flushes. It's better to setup some continuous monitoring solution, such as, metrics collector for Apache Cassandra, and and for period of time watch for the important metrics that include that information as well.

Can I increase batch_size_fail_threshold to 1MB in Cassandra?

Can I increase "batch size fail threshold" to 1MB in Cassandra?
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
option in your cassandra.yaml to increase it, but be sure to test to make sure your actually helping and not hurting you're throughput.
Technically, you can set it to whatever size you want but it would be a bad idea.
A CQL BATCH is there to provide a means for atomic updates of a single partition across multiple tables. It is NOT an optimisation in the same way as batches are in traditional relational databases.
When you place multiple partitions in a CQL batch, the performance of that batch will be worse than issuing multiple separate write requests. We don't recommend it and it is bad practice. Cheers!

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Batch insert overflow

I am using Cassandra 3.10 and am trying to follow best practice by having a table per query so I am using the Batch insert proncipal to insert into multiple tables as a single transaction however I get the following error in the cassandra log.
Batch for [zed.payment, zed.trade_party_b_ref, zed.trade_product_type, zed.trade, zed.fx_variance_swap, zed.trade_party_a_ref, zed.trade_party_b_trade_id, zed.market_value] is of size 5.926KiB, exceeding specified threshold of 5.000KiB by 0.926KiB.
The log is saying that you are sending a batch of almost 6MB when the limit is 5MB.
You should send smaller batches of data to avoid going over that batch size limit.
You can also change the batch size limit in cassandra.yaml, but I would not recommend to change it.
Thanks for the info, the parameter in cassandra.yaml is
Log WARN on any multiple-partition batch size exceeding this value. 5kb per batch by default.
Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
which is in KB, not MB so my batch statement is really 6KB not 6MB.
After 30 years working with Oracle, this is my first venture into Cassandra so I have tried to follow the guidelines of having a separate table for each query so where I have a financial trade table which has to be queried in up to 8 different ways I have 8 tables. That then implies that an insert into the tables must be done in a batch to create what would be a single transaction in Oracle. The master table of the eight has a significant number of sibling tables which must also be included in the batch so here is my point:
If cassandra does not support transactions but relies on the batch functionality to achieve the same effect it must not impose a limit on the size of the batch. If this is not possible then cassandra is really limited to applications with VERY simple data structures.

What does 'Invalid frame size' mean in Thrift

Running cassandra 2.1 clusters here where we see few errors like below from time to time:
ERROR [Thrift-Selector_15] 2017-07-15 01:08:42,677 Message.java:164 - Invalid frame size got (15826670), maximum expected 15728640
Wondering what might be the cause for such and their impact on clusters?
Essentially, this is telling you that the data size of your upsert is too big. You have a couple of options here:
Modify your application logic to write data in smaller amounts.
Increase the thrift_framed_transport_size_in_mb setting in the cassandra.yaml to something that better accommodates your write pattern.
Change your application to use the native binary protocol, which has a higher default frame size (256MB).
I recommend #3 for the long-term. For the short term you could experiment with #2. But Thrift has been deprecated and is disabled by default in the current versions of Cassandra, and will be removed all-together in the near future.

Resources