What does 'Invalid frame size' mean in Thrift - cassandra

Running cassandra 2.1 clusters here where we see few errors like below from time to time:
ERROR [Thrift-Selector_15] 2017-07-15 01:08:42,677 Message.java:164 - Invalid frame size got (15826670), maximum expected 15728640
Wondering what might be the cause for such and their impact on clusters?

Essentially, this is telling you that the data size of your upsert is too big. You have a couple of options here:
Modify your application logic to write data in smaller amounts.
Increase the thrift_framed_transport_size_in_mb setting in the cassandra.yaml to something that better accommodates your write pattern.
Change your application to use the native binary protocol, which has a higher default frame size (256MB).
I recommend #3 for the long-term. For the short term you could experiment with #2. But Thrift has been deprecated and is disabled by default in the current versions of Cassandra, and will be removed all-together in the near future.

Related

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Cassandra gossipinfo severity explained

I was unable to find a good documentation/explanation as to what severity indicates in nodetool gossipinfo. was looking for a detailed explanation but could not find a suitable one.
The severity is a value added to the latency in the dynamic snitch to determine which replica a coordinator will send the read's DATA and DIGEST requests to.
Its value would depend on the IO used in compaction and also it would try to read /proc/stat (same as the iostat utility) to get actual disk statistics as its weight. In post 3.10 versions of cassandra this is removed in https://issues.apache.org/jira/browse/CASSANDRA-11738. In pervious versions you can disable it by setting -Dcassandra.ignore_dynamic_snitch_severity in jvm options. The issue is that it weighting the io use the same as the latency. So if a node is GC thrashing and not doing much IO because of it, it could end up being treated as the target of most reads even though its the worst possible node to send requests to.
Now you can still use JMX to set the value still (to 1) if you want to exclude it from being used for reads. A example use case is using nodetool disablebinary so application wont query it directly, then setting the severity to 1. That node would then only be queried by cluster if theres a CL.ALL request or a read repair. Its a way to take a node "offline" for maintenance from a read perspective but still allow it to get mutations so it doesn't fall behind.
Severity reports activity that happens on the particular node (compaction, etc.), and this information then is used to make a decision on what node could better handle the request. There is discussion in original JIRA about this functionality & how this information is used.
P.S. Please see Chris's answer about changes in post 3.10 versions - I wasn't aware about these changes...

Cassandra High client read request latency compared to local read latency

We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!

batch size of prepared statement in spring data cassandra

I'm getting this warning in the log:
WARN [Native-Transport-Requests:17058] 2014-07-29 13:58:33,776 BatchStatement.java (line 223) Batch of prepared statements for [keyspace.tablex] is of size 10924, exceeding specified threshold of 5120 by 5804.
Is there a way in spring data cassandra to specify the size?
Cassandra 2.0.9 and spring data cassandra 1.0.0-RELEASE
This is just a warning, informing you that the query size exceeds certain limit.
The query is still being processed. The reasoning behind is that bigger batched queries are expensive and may cause cluster imbalance. Therefore warning you (the developer) beforehand.
Look for batch_size_warn_threshold_in_kb in cassandra.yaml to adjust when should this warning be produced.
Here is the ticket where it was introduced: https://issues.apache.org/jira/browse/CASSANDRA-6487
I have done extensive performance testing and tuning on Cassandra, working closely withe DataStax Support.
That is why I created the ingest() methods in SDC*, which are super fast in 1.0.4.RELEASE and higher.
This method caches the PreparedStatement for you, and then loops over the individual Bind values and calls executeAsync for each insert. This sounds counter intuitive, but is the fastest (and most balanced) way to insert into Cassandra.

Cassandra: choosing sstable_size_in_mb

I have Cassandra 1.1.9 with large column family 1.5Tb of size per node. This one has a LeveledCompaction configured.
What is the most appropriate value of sstable_size_in_mb should I choose? Currently we use value of 100Mb and as a result ~20,000 files per node. What issues should I keep in mind while choosing it?
Larger is probably better. For an optimum size, some work was recently done on finding a better size than the small 5M default. The new default size is 160M. You can read about it in CASSANDRA-5727.
If you are using LCS you should consider upgrading to 1.2.x sometime for a lot of improvements there.

Resources