Why is the load different on a 3 node cluster with RF 3? - cassandra

I have a 3 node Cassandra cluster with a replication factor of 3.
This means that all data should be replication on to all 3 nodes.
The following is the output of nodetool status:
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.1 27.66 GB 256 100.0% 2e89198f-bc7d-4efd-bf62-9759fd1d4acc RAC1
UN 192.168.0.2 28.77 GB 256 100.0% db5fd62d-3381-42fa-84b5-7cb12f3f946b RAC1
UN 192.168.0.3 27.08 GB 256 100.0% 1ffb4798-44d4-458b-a4a8-a8898e0152a2 RAC1
This is a graph of disk usage over time on all 3 of the nodes:
My question is why do these sizes vary so much? Is it that compaction hasn't run at the same time?

I would say several factors could play a role here.
As you note, compaction will not run at the same time, so the number and contents of the SSTables will be somewhat different on each node.
The memtables will also not have been flushed to SSTables at the same time either, so right from the start, each node will have somewhat different SSTables.
If you're using compression for the SSTables, given that their contents are somewhat different, the amount of space saved by compressing the data will vary somewhat.
And even though you are using a replication factor of three, I would imagine the storage space for non-primary range data is slightly different than the storage space for primary range data, and it's likely that more primary range data is being mapped to one node or the other.
So basically unless each node saw the exact same sequence of messages at exactly the same time, then they wouldn't have exactly the same size of data.

Related

I add nodes(10 nodes) but cassandra-stress result is slower than single node?

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.170.128 317.66 MiB 256 62.4% 45e953bd-5cca-44d9-ba26-99e0db28398d rack1
UN 192.168.170.129 527.05 MiB 256 60.2% e0d2faec-9714-49cf-af71-bfe2f2fb0783 rack1
UN 192.168.170.130 669.08 MiB 256 60.6% eaa1e39b-2256-4821-bbc8-39e47debf5e8 rack1
UN 192.168.170.132 537.11 MiB 256 60.0% 126e151f-92bc-4197-8007-247e385be0a6 rack1
UN 192.168.170.133 417.6 MiB 256 56.8% 2eb9dd83-ab44-456c-be69-6cead1b5d1fd rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.170.136 386.12 MiB 256 41.0% 2e57fac6-95db-4dc3-88f7-936cd8038cac rack1
UN 192.168.170.137 518.74 MiB 256 40.9% b6d61651-7c65-4ac9-a5b3-053c77cfbd37 rack1
UN 192.168.170.138 554.43 MiB 256 38.6% f1ba3e80-5dac-4a22-9025-85e868685de5 rack1
UN 192.168.170.134 153.76 MiB 256 40.7% 568389b3-304b-4d8f-ae71-58eb2a55601c rack1
UN 192.168.170.135 350.76 MiB 256 38.7% 1a7d557b-8270-4181-957b-98f6e2945fd8 rack1
CREATE KEYSPACE grudb WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3', 'dc2': '2'} AND durable_writes = true;
That's my setting.
CL IS ONE.
In general, a 10-node cluster can sustain a higher throughput, but whether or not this actually translates to higher "cassandra-stress" scores, depends on what exactly you're doing:
First, you need to ensure that the cassandra-stress client is not your bottleneck. For example if the machine running cassandra-stress is at 100% CPU or network utilization, you will never get a better score even if you have 100 server nodes.
Second, you need to ensure that cassandra-stress's concurrency is high enough. In the extreme case, if cassandra-stress sends just one request after another, all you're doing is measuring latency, not throughput. Moreover, it doesn't help if you have 100 nodes if you only send one request at a time to one them. So please try increase cassandra-stress's concurrency to see if that makes any difference.
Now that we got the potential cassandra-stress issues out of the way, let's look at the server. You didn't simply increase your cluster from 1 node to 10 nodes. If you just did that, you'd rightly be surprised if performance didn't increase. But you did something else: You increased to 10 nodes, but also greatly increased the work of writes - in your setup each write needs to go to 5 nodes (!), 3 on one DC and 2 on the other (those are the RFs you configured). So even in the best case, you can't expect write throughput to be more than twice better on this cluster than a single node. Actually, because of all the overhead of this replication, you'll expect even less than twice the performance - so having similar performance is not surprising.
The above estimate was for write performance. For read performance, since you said you're using CL=ONE (you can use CL=LOCAL_ONE, by the way), read throughput should indeed scale linearly with the cluster's size. If it does not, I am guessing you have a problem with the setup like I described above (client bottlenecked or using too little concurrency).
Please try to run read and write benchmarks separately to better understand which of them is the main scalability problem.

Now getting error "message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}"

Cassandra version: dsc-cassandra-2.1.9
Had 3 nodes, one of which was down for a long time. Brought it back up and decomissioned it. Then did a nodetool removenode.
When I try to make a cql query I get the above error.
Initially thought this might be because replication strategy was SimpleStrategy. So did a ALTER KEYSPACE history WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'dc1' : 2};
and changed the endpoint_snitch: GossipingPropertyFileSnitch instead of SimpleSnitch
did a nodetool repair on both nodes and restarted the cassandra services
But the problem is still there. What do I do?
EDIT 1: Nodetool status of machine A
-- Address Load Tokens Owns Host ID Rack
UN 192.168.99.xxx 19.8 GB 256 ? xxxxxxxx-xxxx-xxx-xxxx-xxxxx4ea RAC1
UN 192.168.99.xxx 18.79 GB 256 ? xxxxxxxx-xxxx-xxx-xxxx-xxxxxx15 RAC1
nodetool status output of machine B
-- Address Load Tokens Owns Host ID Rack
UN 192.168.99.xxx 19.8 GB 256 ? xxxxxxxx-xxxx-xxx-xxxx-xxxxxxxx4ea RAC1
UN 192.168.99.xxx 18.79 GB 256 ? xxxxxxxx-xxxx-xxx-xxxx-xxxxxxxxf15 RAC1
What is weird is that under the columns Owns you have no %, only the ?. This same issue occured to me in the past when I bootstrapped a new C* cluster and was using SimpleStrategy and SimpleSnitch. I did like you an ALTER KEYSPACE to switch to NetworkTopology and GossipingPropertyFileSnitch but it did not solve my issue so I rebuilt the cluster from scratch (fortunately I had no data inside)
If you have a backup of your data somewhere, just try to rebuild the 2 nodes from scratch.
Otherwise, consider backing up your sstable files on one node and rebuild the cluster and put the sstables back. Be careful because some file name/folder renaming may be necessary

Reshuffle data evenly across Cassandra ring

I have three-node ring of Apache Cassandra 2.1.12. I inserted some data when it was 2-node ring and then added one more 172.16.5.54 node in the ring. I am using the vnode in my ring. The problem is data is not distributed evenly where as ownership seems distributed evenly. So, how to redistribute the data aross the ring. I have tried with nodetool repair and nodetool cleanup but still no luck.
Moreover, what does this load and own column signify in the nodetool status output.
Also, If out of these three-node if i import the data from one of the node from the file. So, CPU utilization goes upto 100% and finally data on the rest of the two nodes get distributed evenly but not on import running node. Why is it so?
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.16.5.54 1.47 MB 256 67.4% 40d07f44-eef8-46bf-9813-4155ba753370 rack1
UN 172.16.4.196 165.65 MB 256 68.3% 6315bbad-e306-4332-803c-6f2d5b658586 rack1
UN 172.16.3.172 64.69 MB 256 64.4% 26e773ea-f478-49f6-92a5-1d07ae6c0f69 rack1
The columns in the output are explained for cassandra 2.1.x in this doc. The load is the amount of file system data in the cassandra data directories. It seems unbalanced across your 3 nodes, which might imply that your partition keys are clustering on a single node (172.16.4.196), sometimes called a hot spot.
The Owns column is "the percentage of the data owned by the node per datacenter times the replication factor." So I can deduce your RF=2 because each node Owns roughly 2/3 of the data.
You need to fix your partition keys of tables.
Cassandra distributes the data based on partition keys to nodes (using hash partitioning range).
So, for some reason you have alot of data for few partition key value, and almost non for rest partition key values.

Cassandra compaction strategy

We have a big Cassandra cluster 22 Servers (on each server near 20T data )
We recently changed compaction strategy from SizeTieredCompactionStrategy (STCS) to DateTieredCompactionStrategy (DTCS) .
We store many binary files under our cluster . Then after some time we altered many data. Then, to free space, we launched compaction .
So after compaction ended almost all servers freed space . But nodetool status
show that 2 servers still have many duplicated data .
-- Address Load Tokens Owns
UN 1.1.1.1 19.99 TB 256 4.7%
UN 1.1.1.2 18.94 TB 256 4.4%
UN 1.1.1.3 19.55 TB 256 4.5%
UN 1.1.1.4 28.24 TB 256 4.8%
UN 1.1.1.5 23.95 TB 256 4.7%
For all data we use gc_grace_seconds=0 AND
So I started compactions once more on this 2 servers . (and its no avail)
nodetool compactionhistory
Looks like for biggest tables
data data1 1441346309116 7694331659 7694326967 {1:25608, 2:138}
Looks like didn't drop any altered data .
It may be so large difference between servers data storage ? Or its some problem related to strategy change ?
Big thanks for your help .

Determining how full a Cassandra cluster is

I just imported a lot of data in a 9 node Cassandra cluster and before I create a new ColumnFamily with even more data, I'd like to be able to determine how full my cluster currently is (in terms of memory usage). I'm not too sure what I need to look at. I don't want to import another 20-30GB of data and realize I should have added 5-6 more nodes.
In short, I have no idea if I have too few/many nodes right now for what's in the cluster.
Any help would be greatly appreciated :)
$ nodetool -h 192.168.1.87 ring
Address DC Rack Status State Load Owns Token
151236607520417094872610936636341427313
192.168.1.87 datacenter1 rack1 Up Normal 7.19 GB 11.11% 0
192.168.1.86 datacenter1 rack1 Up Normal 7.18 GB 11.11% 18904575940052136859076367079542678414
192.168.1.88 datacenter1 rack1 Up Normal 7.23 GB 11.11% 37809151880104273718152734159085356828
192.168.1.84 datacenter1 rack1 Up Normal 4.2 GB 11.11% 56713727820156410577229101238628035242
192.168.1.85 datacenter1 rack1 Up Normal 4.25 GB 11.11% 75618303760208547436305468318170713656
192.168.1.82 datacenter1 rack1 Up Normal 4.1 GB 11.11% 94522879700260684295381835397713392071
192.168.1.89 datacenter1 rack1 Up Normal 4.83 GB 11.11% 113427455640312821154458202477256070485
192.168.1.51 datacenter1 rack1 Up Normal 2.24 GB 11.11% 132332031580364958013534569556798748899
192.168.1.25 datacenter1 rack1 Up Normal 3.06 GB 11.11% 151236607520417094872610936636341427313
-
# nodetool -h 192.168.1.87 cfstats
Keyspace: stats
Read Count: 232
Read Latency: 39.191931034482764 ms.
Write Count: 160678758
Write Latency: 0.0492021849459404 ms.
Pending Tasks: 0
Column Family: DailyStats
SSTable count: 5267
Space used (live): 7710048931
Space used (total): 7710048931
Number of Keys (estimate): 10701952
Memtable Columns Count: 4401
Memtable Data Size: 23384563
Memtable Switch Count: 14368
Read Count: 232
Read Latency: 29.047 ms.
Write Count: 160678813
Write Latency: 0.053 ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 115533264
Key cache capacity: 200000
Key cache size: 1894
Key cache hit rate: 0.627906976744186
Row cache: disabled
Compacted row minimum size: 216
Compacted row maximum size: 42510
Compacted row mean size: 3453
-
[default#stats] describe;
Keyspace: stats:
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Durable Writes: true
Options: [replication_factor:3]
Column Families:
ColumnFamily: DailyStats (Super)
Key Validation Class: org.apache.cassandra.db.marshal.BytesType
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type
Row cache size / save period in seconds / keys to save : 0.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 200000.0/14400
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
Replicate on write: true
Built indexes: []
Column Metadata:
(removed)
Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
Obviously, there are two types of memory -- disk and RAM. I'm going to assume you're talking about disk space.
First, you should find out how much space you're currently using per node. Check the on-disk usage of the cassandra data dir (by default /var/lib/cassandra/data) with this command: du -ch /var/lib/cassandra/data You should then compare that to the size of your disk, which can be found with df -h. Only consider the entry for the df results for the disk your cassandra data is on, by checking the Mounted on column.
Using those stats, you should be able to calculate how full in % the cassandra data partition. Generally you don't want to get too close to 100% because cassandra's normal compaction processes temporarily use more disk space. If you don't have enough, then a node can get caught with a full disk, which can be painful to resolve (as I side note I occasionally keep a "ballast" file of a few Gigs that I can delete just in case I need to open some extra space). I've generally found that not exceeding about 70% disk usage is on the safe side for the 0.8 series.
If you're using a newer version of cassandra, then I'd recommend giving the Leveled Compaction strategy a shot to reduce temporary disk usage. Instead of potentially using twice as much disk space, the new strategy will at most use 10x of a small, fixed size (5MB by default).
You can read more about how compaction temporarily increases disk usage on this excellent blog post from Datastax: http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It also explains the compaction strategies.
So to do a little capacity planning, you can figure up how much more space you'll need. With a replication factor of 3 (what you're using above), adding 20-30GB of raw data would add 60-90GB after replication. Split between your 9 nodes, that's maybe 3GB more per node. Does adding that kind of disk usage per node push you too close to having full disks? If so, you might want to consider adding more nodes to the cluster.
One other note is that your nodes' loads aren't very even -- from 2GB up to 7GB. If you're using the ByteOrderPartitioner over the random one, then that can cause uneven load and "hotspots" in your ring. You should consider using random if possible. The other possibility could be that you have extra data hanging out that needs to be taken care of (Hinted Handoffs and snapshots come to mind). Consider cleaning that up by running nodetool repair and nodetool cleanup on each node one at a time (be sure to read up on what those do first!).
Hope that helps.

Resources