Cassandra latency patterns under constant load - cassandra

I've got pretty unusual latency patterns in my production setup:
the whole cluster (3 machines: 48 gig ram, 7500 rpm disk, 6 cores) shows latency spikes every 10 minutes, all machines at the same time.
See this screenshot.
I checked the logfiles and it seems as there are no compactions taking place at that time.
I've got 2k reads and 5k reads/sec. No optimizations have been made so far.
Caching is set to "ALL", hit rate for row cache is at ~0,7.
Any ideas? Is tuning memtable size an option?
Best,
Tobias

Related

Why is counter cache not being utilized after I increase the size?

My Cassandra application entails primarily counter writes and reads. As such, having a counter cache is important to performance. I increased the counter cache size in cassandra.yaml from 1000 to 3500 and did a cassandra service restart. The results were not what I expected. Disk use went way up, throughput went way down and it appears the counter cache is not being utilized at all based on what I'm seeing in nodetool info (see below). It's been almost two hours now and performance is still very bad.
I saw this same pattern yesterday when I increased the counter cache from 0 to 1000. It went quite awhile without using the counter cache at all and then for some reason it started using it. My question is whether there is something I need to do to activate counter cache utilization?
Here are my settings in cassandra.yaml for the counter cache:
counter_cache_size_in_mb: 3500
counter_cache_save_period: 7200
counter_cache_keys_to_save: (currently left unset)
Here's what I get out of nodetool info after about 90 minutes:
Gossip active : true
Thrift active : false
Native Transport active: false
Load : 1.64 TiB
Generation No : 1559914322
Uptime (seconds) : 6869
Heap Memory (MB) : 15796.00 / 20480.00
Off Heap Memory (MB) : 1265.64
Data Center : WDC07
Rack : R10
Exceptions : 0
Key Cache : entries 1345871, size 1.79 GiB, capacity 1.95 GiB, 67936405 hits, 83407954 requests, 0.815 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 5294462, size 778.34 MiB, capacity 3.42 GiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 24064, size 1.47 GiB, capacity 1.47 GiB, 65602315 misses, 83689310 requests, 0.216 recent hit rate, 3968.677 microseconds miss latency
Percent Repaired : 8.561186035383143%
Token : (invoke with -T/--tokens to see all 256 tokens)
Here's a nodetool info on the Counter Cache prior to increasing the size:
Counter Cache : entries 6802239, size 1000 MiB, capacity 1000 MiB,
57154988 hits, 435820358 requests, 0.131 recent hit rate,
7200 save period in seconds
Update:
I've been running for several days now trying various values of the counter cache size on various nodes. It is consistent that the counter cache isn't enabled until it reaches capacity. That's just how it works as far as I can tell. If anybody knows a way to enable the cache before it is full let me know. I'm setting it very high because it seems optimal but that means that the cache is down for several hours while it fills up and while it's down my disks are absolutely maxed out with read requests...
Another update:
Further running shows that occasionally the counter cache does kick in before it fills up. I really don't know why that is. I don't see a pattern yet. I would love to know the criteria for when this does and does not work.
One last update:
While the counter cache is filling up native transport is disabled for the node as well. Setting the counter to 3.5 GB I'm now going 24 hours with the node in this low performance state with native transport disabled.
I have found out a way to 100% of the time avoid the counter cache not being enabled and native transport mode disabled. This approach avoids the serious performance problems I encountered waiting for the counter cache to enable (sometimes for hours in my case since I want a large counter cache):
1. Prior to starting Cassandra, set cassandra.yaml file field counter_cache_size_in_mb to 0
2. After starting cassandra and getting it up and running use node tool commands to set the cache sizes:
Example command:
nodetool setcachecapacity 2000 0 1000
In this example, the first value of 2000 sets the key cache size, the second value of 0 is the row cache size and the third value of 1000 is the counter cache size.
Take measurements and decide if those are the optimal values. If not, you can repeat step two without restarting Cassandra with new values as needed
Further details:
Some things that don't work:
Setting the counter_cache_size_in_mb value if the counter cache is not yet enabled. This is the case where you started Cassandra with a non-zero value in counter_cache_size_in_mb in Cassandra.yaml and you have not yet reached that size threshold. If you do this, the counter cache will never enabled. Just don't do this. I would call this a defect but it is the way things currently work.
Testing that I did:
I tested this on five separate nodes multiple times with multiple values. Both initially when Cassandra is just coming up and after some period of time. This method I have described worked in every case. I guess I should have saved some screenshots of nodetool info to show results.
One last thing: If Cassandra developers are watching could they please consider tweaking the code so that this workaround isn't necessary?

YCSB low read throughput cassandra

The YCSB Endpoint benchmark would have you believe that Cassandra is the golden child of Nosql databases. However, recreating the results on our own boxes (8 cores with hyperthreading, 60 GB memory, 2 500 GB SSD), we are having dismal read throughput for workload b (read mostly, aka 95% read, 5% update).
The cassandra.yaml settings are exactly the same as the Endpoint settings, barring the different ip addresses, and our disk configuration (1 SSD for data, 1 for a commit log). While their throughput is ~38,000 operations per second, ours is ~16,000 regardless (relatively) of the threads/number of client nodes. I.e. one worker node with 256 threads will report ~16,000 ops/sec, while 4 nodes will each report ~4,000 ops/sec
I've set the readahead value to 8KB for the SSD data drive. I'll put the custom workload file below.
When analyzing disk io & cpu usage with iostat, it seems that the reading throughput is consistently ~200,000 KB/s, which seems to suggest that the ycsb cluster throughput should be higher (records are 100 bytes). ~25-30% of cpu seems to be under %iowait, 10-25% in use by the user.
top and nload stats are not ostensibly bottlenecked (<50% memory usage, and 10-50 Mbits/sec for a 10 Gb/s link).
# The name of the workload class to use
workload=com.yahoo.ycsb.workloads.CoreWorkload
# There is no default setting for recordcount but it is
# required to be set.
# The number of records in the table to be inserted in
# the load phase or the number of records already in the
# table before the run phase.
recordcount=2000000000
# There is no default setting for operationcount but it is
# required to be set.
# The number of operations to use during the run phase.
operationcount=9000000
# The offset of the first insertion
insertstart=0
insertcount=500000000
core_workload_insertion_retry_limit = 10
core_workload_insertion_retry_interval = 1
# The number of fields in a record
fieldcount=10
# The size of each field (in bytes)
fieldlength=10
# Should read all fields
readallfields=true
# Should write all fields on update
writeallfields=false
fieldlengthdistribution=constant
readproportion=0.95
updateproportion=0.05
insertproportion=0
readmodifywriteproportion=0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=usertable
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
The key was increasing native_transport_max_threads in the casssandra.yaml file.
Along with the increased settings in the comment (increasing connections in ycsb client as well as concurrent read/writes in cassandra), Cassandra jumped to ~80,000 ops/sec.

Cassandra: Unusual High CPU Util at same moment (everyday)

We are using 3-nodes Cassandra Cluster in production from 4-5 months, We haven't faced any major issue, however, every day at 2200 hours (UTC), we see a spike in CPU Utilization (see the attached graph).
I have confirmed there is no major load from our end at that time. Also, most of the time CPU Util is < 1%, (< 15% while querying and few jobs). We take snapshots at a different time, nowhere close to 2200 hours.
There is nothing unusual in system.log and debug.log.
One more important thing is, it happens only for one node. All of them reache their highest CPU Util, but one of the node is always thrice as much as others.
(Orange colors are first node. All the peaks are at 2200 hours)
Graph for all 3 nodes.
Graph for 2 nodes, except first one (orange). As per #PhillipBlum asked.

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Solr Indexing Time

Solr 1.4 is doing great with respect to Indexing on a dedicated physical server (Windows Server 2008). For Indexing around 1 million full text documents (around 4 GB size) it takes around 20 minutes with Heap Size = 512M - 1G & 4GB RAM.
However while using Solr on a VM, with 4 GB RAM it took 50 minutes to index at the first time. Note that there is no Network delays and no RAM issues. Now when I increased the RAM to 8GB and increased the heap size, the indexing time increased to 2 hrs. That was really strange. Note that except for SQL Server there is no other process running. There are no network delays. However I have not checked for File I/O. Can that be a bottleneck? Does Solr has any issues running in "Virtualization" Environment?
I read a paper today by Brian & Harry: "ON THE RESPONSE TIME OF A SOLR SEARCH ENGINE IN A VIRTUALIZED ENVIRONMENT" & they claim that performance gets deteriorated when RAM is increased when Solr is running on a VM but that is with respect to query times and not indexing times.
I am bit confused as to why it took longer on a VM when I repeated the same test second time with increased heap size and RAM.
I/O on a VM will always be slower than on dedicated hardware. This is because the disk is virtualized and I/O operations must pass through an extra abstraction layer. Indexing requires intensive I/O operations, so it's not surprising that it runs more slowly on a VM. I don't know why adding RAM causes a slowdown though.

Resources