Cassandra system hints large partition

Cassandra system hints large partition - cassandra

We are using cassandra 2.1.14. Currently large partition warning is seen on system.hints table.
How to make sure that system.hints table doesn't have wide partitions ?
Note that we don't want to upgrade to cassandra 3 now.
Is there a periodic way to clean up system.hints ?
Will this cause I/O spike in cassandra cluster ?
Log:
Compacting large partition system/hints:
10ad72eb-0240-4b94-b73e-eb4dc2aa759a (481568345 bytes)

How to make sure that system.hints table doesn't have wide partitions?
There's not really much you can do about that. system.hints is partitioned on target_id which is the hostID of the target node. If 10000 hints build up for one node, there's really no other place for them to go.
Is there a periodic way to clean up system.hints?
As mentioned above, hints should TTL after 3 hours. This failsafe is meant to keep the system.hints table from getting too out of control. But it's not at all fool-proof.
One way to be sure, would be to clear them out via nodetool:
nodetool truncatehints
Will this cause I/O spike in cassandra cluster?
Running nodetool truncatehints is fairly innocuous. I haven't noticed a spike from running it before.

Related

How often should I run nodetool compact and repair in Cassandra?

We have 14 node cassandra cluster v 3.5.
Can someone enlighten with compact & repair ?
If I am running from one of node, does this needs to be runs from all the nodes in cluster
nodetool compact
I see it is very slow, how often this supposed to be run ?
Same question regarding nodetool repair ( All nodes or certain nodes in cluster)
nodetool repair or
nodetool repair -pr
how often this supposed to be run ?

Compactions are part of the normal operation of Cassandra nodes. They run automatically in the background (otherwise known as minor compactions) and get triggered by each table's defined compaction strategy based on any combination of configured thresholds and compaction sub-properties. This video extract from the DS201 Cassandra Foundations course at the DataStax Academy talks about compactions in more detail.
It is not necessary for an operator/administrator to manually kick off compactions with nodetool compact. In fact, it is not recommended to trigger user-defined compactions (otherwise known as major compactions) because they can create problems down the track like the one I explained in this post -- https://community.datastax.com/questions/6396/.
Repairs on the other hand is something that needs to be managed by a cluster administrator. Since Cassandra has a distributed architecture, it is necessary to run repairs to keep the copies of the data consistent between replicas (Cassandra nodes).
Repairs need to be run at least once every gc_grace_seconds (configured per table). By default, GC grace is 10 days (864000 seconds) so most DB admins run a repair on each node once a week. This short video from the DS210 Cassandra Operations course provides a good overview of Cassandra repairs.
Running a partitioner range repair (with -pr flag) on a node repairs only the data that a node owns so it is necessary to run nodetool repair -pr on each node, one node at a time, until all nodes in the cluster have been repaired. This blog post by Jeremiah Jordan is a good explanation of why this is necessary.
If you're interested, datastax.com/dev has free resources for learning Cassandra. The Cassandra Fundamentals series in particular is a good place to start. It is a collection of short online tutorials where you can quickly learn the basic concepts for free. Cheers!

If I am running from one of node, does this needs to be runs from all the nodes in cluster nodetool compact I see it is very slow, how often this supposed to be run ?
You should not run nodetool compact command generally. Compactions are by default meant to run automatically behind the scene if not disabled. Running compaction manually may create more problems and should be avoided for most of the cases. Auto compactions which run behind the scene should be able to handle your compactions. If you feel your compactions are slow you can tune your compactions by looking after the parameters related to compaction here (Mostly concurrent_compactors and compaction_throughput_mb_per_sec)
Same question regarding nodetool repair ( All nodes or certain nodes
in cluster) nodetool repair or nodetool repair -pr how often this
supposed to be run ?
Repair is a maintenance task which should be run on all then nodes once before each gc_grace_seconds period. For example default gc_grace_seconds is equal to 10 days so it is required to run repair on all the nodes once in this 10 day period. You should schedule your repair to run regularly once in gc_grace_seconds period. Regarding which option to use for running repair. If you are doing it by yourself you should run nodetool repair -pr on all the nodes one by one.

Data Inconsistency in Cassandra Cluster after migration of data to a new cluster

I see some data inconsistency after moving data to a new cluster.
Old cluster has 9 nodes in total and each has got 2+ TB of data on it.
New cluster has same set of nodes as old and configuration is same.
Here is what I've performed in order:
nodetool snapshot.
Copied that snapshot to destination
Created a new Keyspace on Destination Cluster.
Used sstableloader utility to load.
Restarted all nodes.
After successful completion of transfer, I ran few queries to compare(Old vs New Cluster) and found out that the new cluster is not consistent but the data I see is properly distributed on each node (nodetool status).
Same query returns different sets of results for some of the partitions and I get zero rows first time, second time 100 rows,200 rows and eventually it becomes consistent for few partitions and record count matches with old cluster.
Few partitions have no data in the new cluster where as old cluster has data for those partitions.
I tried running queries on cqlsh with CONSISTENCY ALL but the problem still exist.
Did i miss any important steps to consider before and after?
Is there any procedure to find out the root cause of this?
I am currently running "nodetool repair" but I doubt if that could solve as I tried with Consistency ALL.
Highly Appreciate your help!

The fact that the results eventually becomes consistent indicates that the replicas are out-of-sync.
You can verify this by reviewing the logs around the time that you were loading data, particularly for dropped mutations. You can also check the output of nodetool netstats. If you're seeing blocking read repairs, that's another confirmation that the replicas are out-of-sync.
If you still have other partitions you can test, enable TRACING ON in cqlsh when you query with CONSISTENCY ALL. You will see if there are digest mismatches in the trace output which should also trigger read repairs. Cheers!
[EDIT] Based on your comments below, it sounds like you possibly did not load the snapshots from ALL the nodes in the source cluster with sstableloader. If you've missed loading SSTables to the target cluster, then that would explain why data is missing.

What does Cassandra nodetool repair exactly do?

From http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html I know that
The nodetool repair command repairs inconsistencies across all of the replicas for a given range of data.
but how does it fix the inconsistencies? It's written it uses Merkle trees - but that's for comparison not for fixing 'broken' data.
How the data can be 'broken'? Any common cases despite hard drive failure?
Question aside: it's compaction which evicts tombstones, right? So the requirement for running nodetool repair more frequently than gc_grace seconds is only to ensure that all data is spread to appropriate replicas? Shouldn't be that the usual scenario?

The data can become inconsistent whenever a write to a replica is not completed for whatever reason. This can happen if a node is down, if the node is up but the network connection is down, if a queue fills up and the write is dropped, disk failure, etc.
When inconsistent data is detected by comparing the merkle trees, the bad sections of data are repaired by streaming them from the nodes with the newer data. Streaming is a basic mechanism in Cassandra and is also used for bootstrapping empty nodes into the cluster.
The reason you need to run repair within gc grace seconds is so that tombstones will be sync'd to all nodes. If a node is missing a tombstone, then it won't drop that data during compaction. The nodes with the tombstone will drop the data during compaction, and then when they later run repair, the deleted data can be resurrected from the node that was missing the tombstone.

cassandra nodetool repair what does it really do?

Hi all (cassandra noob here),
I'm trying understand exactly what going on..with repair..trying to get to the point where we can run our repairs on a schedule.
I have setup a 4 DC (3 nodes per dc) Cassandra 2.1 cluster, 4gb Ram (1gb heap), HDD.
Due to various issues..(repair taking too long/crashing nodes with OOM), I decided to start fresh, nuke everything, I deleted /var/lib/cassandra/data, and /opt/cassandra/data
Recreated my keyspaces (no data), and ran nodetool repair -par -inc -local
I was supprised to see it took ~5min to run, watching the logs I see merkle trees being generated...
I guess..my question is, if my keyspaces have no data..whats it generating Merkle trees against?
After running a local-dc repair against each node in that dc, I decided to run a cross-dc repair, again with no data..
this time it took 4+ hours..with no data? This really feels wrong to me, what am I missing?
Thanks

Cassandra nodetool repair options

I have a 15 node cluster with RF 3 (using vnodes). We are ingesting data into the 15 nodes from multiple clients. It turns out that one of the nodes has been down for a couple of days and it's now almost 200 GBs behind, the other nodes have approx 380 GB.
What sort of nodetool repair would you recommend here? I know that the nodetool repair operation is CPU intensive and this might affect the rate at which the clients would be ingesting into the cluster. There seems to be several nodetool repair operations such as -snapshot, -par, etc and I was wondering if any of these options would better suit my current scenario.
I'm trying to run the repair with the least performance hit possible on the cluster.
Thanks,
mskh

Unless you have already taken a snapshot to repair from, the -snapshot option won't do you any good.
Do you have multiple datacenters? If so, you could do a nodetool repair -local, which would only repair your node from nodes in its local datacenter. This is a good way to repair a node without affecting overall cluster performance.
Otherwise Rock's suggestion of repairing only the first partition range (in parallel) is worth trying, as well.

You can use sh nodetool repair -par to ensure minimum impact for online cluster on each node.
Run sh nodetool cleanup once repair is done.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string