Cassandra - Understanding Java Heap Behavior (depending on internet connection?)

Cassandra - Understanding Java Heap Behavior (depending on internet connection?) - cassandra

Cassandra - Understanding Java Heap Behavior (depending on internet connection?)
We are running tests on cassandra 1.2.5. and cannot understand fully the behavior of Java Heap.
The same test have a different behavior depending on the location within the company network from which it was started. Is it possible that the internet connection has an effect on cassandras Java Heap behavior. It seems for us like decreasing the upload speed has changed the behavior.
Picture taken after a test on the most worst connection of a Java Heap overflow. VPN, low upload speed.
Picture taken after a test on the best connection.
We have made a third test with an average connection and have a bahaviour between the both shown o the pictures. WLAN.
Our configuration are:
We use cassandra 1.2.5 for first doesn't change much the original setting in the cassandra.yaml, except of:
key_cache_size_in_mb: 0
we just set seeds, listen_address and rpc_address
cassandra-sh.env was not changed. So cassandra get nearly ~2G of our 7.9G RAM.
We are using the JDBC driver.
We are just testing with write load.
The test is pushing something about 200 kb per second.
We have no wide rows.
Except of "pretty normal insert", we are updating a variable every write.
Some ideas to understand that issue will really help us...

Related

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.

Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Cassandra High client read request latency compared to local read latency

We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.

#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!

Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency

thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!

What's the nature of Cassandra "write timeout"?

I am running a write-heavy program (10 threads peaks at 25K/sec writes) on a 24 node Cassandra 3.5 cluster on AWS EC2 (each host is of c4.2xlarge type: 8 vcore and 15G ram)
Every once in a while my Java client, using DataStax driver 3.0.2, would get write timeout issue:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency TWO (2 replica were required but only 1 acknowledged the write)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:73)
at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
The error happens infrequently and in a very unpredictable way. So far, I am not able to link the failures to anything specific (e.g. program running time, data size on disk, time of the day, indicators of system load such as CPU, memory, network metrics) Nonetheless, it is really disrupting our operations.
I am trying to find the root cause of the issue. Looking online for options, I am a bit overwhelmed by all the leads out there, such as
Changing "write_request_timeout_in_ms" in "cassandra.yaml" (already changed to 5 seconds)
Using proper "RetryPolicy" to keep the session going (already using DowngradingConsistencyRetryPolicy on a ONE session level consistency level)
Changing cache size, heap size, etc. - never tried those b/c there are good reasons to discount them as the root cause.
One thing is really confusing during my research is that I am getting this error from a fully replicated cluster with very few ClientRequest.timeout.write events:
I have a fully-replicated 24 node cluster spans 5 aws regions. Each region has at least 2 copies of the data
My program runs consistency level ONE at Session level (Cluster builder with QueryOption)
When the error happened, our Graphite chart registered no more than three (3) host hiccups, i.e. having the Cassandra.ClientRequest.Write.Timeouts.Count values
I already set write_timeout to 5 seconds. The network is pretty fast (using iperf3 to verify) and stable
On paper, the situation should be well within Cassandra's failsafe range. But why my program still failed? Are the numbers not what they appear to be?

Its not always necessarily a bad thing to see timeouts or errors especially if you're writing at a higher consistency level, the writes may still get through.
I see you mention CL=ONE you could still get timeouts here but the write (mutation) still have got through. I found this blog really useful: https://www.datastax.com/dev/blog/cassandra-error-handling-done-right. Check your server side (node) logs at the time of the error to see if you have things like ERROR / WARN / GC pauses (like one of the comments mentions above) these kind of events can cause the node to be unresponsive and therefor a timeout or other type of error.
If your updates are idempotent (ideally) then you can build in some retry mechanism.

Datastax Cassandra repair service weird estimation and heavy load

I have a 5 node cluster with around 1TB of data. Vnodes enabled. Ops Center version 5.12 and DSE 4.6.7. I would like to do a full repair within 10 days and use the repair service in Ops Center so that i don't put unnecessary load on the cluster.
The problem that I'm facing is that repair service puts to much load and is working too fast. It progress is around 30% (according to Ops Center) in 24h. I even tried to change it to 40 days without any difference.
Questions,
Can i trust the percent-complete number in OpsCenter?
The suggested number is something like 0.000006 days. Could that guess be related to the problem?
Are there any settings/tweaks that could be useful to lower the load?

You can use OpsCenter as a guideline about where data is stored and what's going on in the cluster, but it's really more of a dashboard. The real 'tale of the tape' comes from 'nodetool' via command line on server nodes such as
#shell> nodetool status
Status=Up/Down |/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack UN 10.xxx.xxx.xx 43.95 GB 256 33.3%
b1e56789-8a5f-48b0-9b76-e0ed451754d4 RAC1
What type of compaction are you using?
You've asked a sort of 'magic bullet' question, as there could be several factors in play. These are examples but not limited to:
A. Size of data, and the whole rows in Cassandra (you can see these with nodetool cf_stats table_size entries). Rows that result in a binary size of larger than 16M will be seen as "ultra" wide rows, which might be an indicator your schema in your data model needs a 'compound' or 'composite' row key.
B. Type of setup you have with respects to replication and network strategy.
C. Data entry point, how Cassandra gets it's data. Are you using Python? PHP? What inputs the data? You can get funky behavior from a cluster with a bad PHP driver (for example)
D. Vnodes are good, but can be bad. What version of Cassandra are you running? You can find out via CQLSH with cqlsh -3 then type 'show version'
E. Type of compaction is a big killer. Are you using SizeTieredCompaction or LevelCompaction?
Start by running 'nodetool cfstats' from command line on the server any given node is running on. The particular areas of interest would be (at this point)
Compacted row minimum size:
Compacted row maximum size:
More than X amount of bytes in size here on systems with Y amount of RAM can be a significant problem. Be sure Cassandra has enough RAM and that the stack is tuned.
The default configuration for performance on Cassandra should normally be enough, so the next step would be to open a CQLSH interface to the node with 'cqlsh -3 hostname' and issue the command 'describe keyspaces'. Take the known key space name you are running and issue 'describe keyspace FOO' and look at your schema. Of particular interest are your primary keys. Are you using "composite rowkeys" or "composite primary key"? (as described here: http://www.datastax.com/dev/blog/whats-new-in-cql-3-0 )If not, you probably need to depending on read/write load expected.
Also check how your initial application layer is inserting data into Cassandra? Using PHP? Python? What drivers are being used? There are significant bugs in Cassandra versions < 1.2.10 using certain Thrift connectors such as the Java driver or the PHPcassa driver so you might need to upgrade Cassandra and make some driver changes.
In addition to these steps also consider how your nodes were created.
Note that migration from static nodes to virtual nodes (or vnodes) has to be mitigated. You can't simply switch configs on a node that's already been populated. You will want to check your initial_token: settings in /etc/cassandra/cassandra.yaml. The questions I ask myself here are "what initial tokens are set? (no initial tokens for vnodes) were the tokens changed after the data was populated?" For static nodes which I typically run, I calculate them using a tool like: [http://www.geroba.com/cassandra/cassandra-token-calculator/] as I've run into complications with vnodes (though they are much more reliable now than before).

cassandra java heap space issues with phpcassa

We are currently doing some stress tests with ab tool. The single inserts are doing fine in cassandra. However, when it comes to batch inserts, I'm currently dealing with java out of memory error: Java Heap Space.
I have a virtual box machine with Ubuntu server 13.04 installed in it with 2G of memory
I don't know much about internal configuration in cassandra.
I'm just making a batch insert with size 100(100 insert in a BATCH).
After the I see this error, I have no longer cqlsh access, no nodetool access for almost 1 hour.
How can I fix this error in heavy loads ?
NOTE : It doesn't happen on single inserts with a HTTP POST requests.
NOTE : In my column family, I have a key with TimeUUIDType and the column values are int s and varchar s
UPDATE : Test results show that I didn't have anything wrong before 6000 requests. However, when it comes to 7000, the php code throws the following;
Error connecting to 127.0.0.1: Thrift\Exception\TTransportException: TSocket: timed out reading 4 bytes from 127.0.0.1:9160
Morever, cassandra logs the following in heavy loads;
WARN [ScheduledTasks:1] 2013-06-28 03:43:07,931 GCInspector.java (line 142)
Heap is 0.9231763795560355 full. You may need to reduce memtable and/or cache sizes.
Cassandra will now flush up to the two largest memtables to free up memory. Adjust
flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to
do this automatically

The batch doesn't sound like a large enough dataset to cause the memory problem, so this sounds like a problem with the JVM on the virtual machine. How much memory have you allocated to it?
You can check by starting JConsole (just type jconsole in the terminal / prompt) and viewing the 'Memory' tab, specifically the value under Max:
You can also get some solid details about what caused the crash thanks to the XX:+HeapDumpOnOutOfMemoryError parameter included in C*'s startup script, its basically a log file storing the stacktrace that caused the memory problem.
Typically the heap size is calculated automatically by the calculate_heap_sizes() function in cassandra-env.sh. You can however override the number that function generated by setting MAX_HEAP_SIZE to a different value. The same variable is used on lines 174 & 175 in cassandra-env.sh JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}" for setting the min and max heap size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string