How to track data movement in YugabyteDB in multi-region replication - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
When I'm changing my placement info by adding a new datacenter, my data should start moving to accomplish my "rules". How can I track that movement?
get_load_move_completion This should give me some info?

get_load_move_completion command in yb-admin only tracks load movement when a node is decommissioned (blacklisted). Essentially, it returns 1 - (count of total replicas still present on the blacklisted nodes) / (total initial replica count on the blacklisted nodes) as a percentage.
The http://<yb-master-ip>:7000/tasks endpoint in the master admin UI is where you would be able to see all the adds and removals.

Related

how to remove ghost peers from a failed TiDB scale-in operation

I scaled in a TiDB cluster a few weeks ago to remove a misbehaving TiKV peer.
The peer refused to tombstone even after a full week so I turned the server itself off, left a few days to see if there were any issues, and then ran a forced scale-in to remove it from the cluster.
Even though tiup cluster display {clustername} no longer shows that server, some of the other TiKV servers keep trying to contact it.
Example log entries:
[2022/10/13 14:14:58.834 +00:00] [ERROR] [raft_client.rs:840] ["connection abort"] [addr=1.2.3.4:20160] [store_id=16025]
[2022/10/13 14:15:01.843 +00:00] [ERROR] [raft_client.rs:567] ["connection aborted"] [addr=1.2.3.4:20160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error=Some(RemoteStopped)] [store_id=16025]
(IP replaced with 1.2.3.4, but the rest is verbatim)
the server in question has been removed from the cluster about a month now and yet the TiKV nodes still think it's there.
How do I correct this?
the store_id might be a clue - I believe there is a Raft store where the removed server was a leader, but how do I force that store to choose a new leader? The documentation is not clear on this, but I believe the solution has something to do with the PD servers.
Could you first check the store id in pd-ctl to ensure it's in tombstone?
For pd-ctl usage, please refer to https://docs.pingcap.com/tidb/dev/pd-control.
You can use pd-ctl to delete a store and once it's tombstone, then use pd-ctl remove-tombstone to remove it completely.
For all regions in TiKV, if its leader is disconnected, the followers will re-elect leaders and that dead TiKV node won't be leaders of regions anyway.

Checking disk size consumed by each database in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
Is there any way we can check the disk size consumed by each database, tables? I tried the query below but no luck:
- \l+
- select pg_database_size('databaseName');
- select t1.datname AS db_name,
pg_size_pretty(pg_database_size(t1.datname)) as db_size
from pg_database t1
order by pg_database_size(t1.datname) desc;
YugabyteDB uses its native data storage called DocDB (i.e. Rocks DB) to store the table's data. These catalog tables will not have any sizing details. You can access individual table level details directly in the YugabyteDB Tablet Server UI http://<ipaddress>:9000/tables. It will show the on-disk space column for each table.

Query not being redirected into read replica cluster in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I think I have still misunderstood something about read replicas. It seems that if I run a query that looks like a sequence scan on a read-only replica, the actual read is done on the main cluster. Although the read replica seems to have the whole dataset.
When I ran a simple select count(*) query in the read replica, I expected it to do a local read from its own data.
However, as can be seen from the picture the main nodes actually started to do the reads and the read replica waited in an almost idle state until it got the response from the main node. Where did I go wrong? (using YugabyteDB 2.6)
Note that read from followers is only available in 2.11 that was recently released: https://blog.yugabyte.com/announcing-yugabytedb-2-11/
Another thing to remember is, even on 2.11, the default behavior, no matter which node you connect to will redirect the read to corresponding leader tablets. You'll have to enable reading from followers in the current session like below:
SET yb_read_from_followers = true;
START TRANSACTION READ ONLY;
SELECT * from t WHERE k='k1'; --> follower read
k | v
----+----
k1 | v1
(1 row)
COMMIT;
This will let YugabyteDB know that it is ok to read from a follower tablet (that is in a read-replica cluster).
Also, even in the read-replica cluster, the node receiving the request might not have the data. For example, you could have a 5-node read-replica cluster with RF=2 on that cluster. So the node that initially receives the request might not necessarily have the data that the request is interested in. Where the request is routed depends on whether the default setting is being used for the session/statement, which is to read from leader tablets. But if read-from-follower is enabled, then the request will be routed to a follower tablet in the same region.

Cassandra read consistency is one, but node connects to another node

3 node cluster and RF of 3 means every node has all the data. Consistency is ONE.
So when queried for some data on node-1, ideally as node-1 has all the data it should be able to complete my query.
But when I checked how my query is running using 'tracing on', it shows me that it connects to node-2 also, which is not needed as per my understanding.
Am I missing something here ?
Thanks in advance.
Edited ::
Added the output of 'tracing on'.
It can be seen in the image that, node 10.101.201.3 has contacted 10.101.201.4
3 node cluster and RF of 3 means every node has all the data.
Just because every node has 100% of all data, does not mean that every node has 100% of all token ranges. The token ranges in a 3 node cluster will be split-up evenly # ~33% each.
In short, node-1 may have all of the data, but it's only primarily responsible for 33% of it. When the partition key is hashed, the query is likely being directed toward node-2 because it is primarily responsible for that partition key...despite the fact that the other nodes contain secondary and tertiary replicas.
cqlsh. Does this change if I'm running the query from application code?
Yes, because the specified load balancing policy (configured in app code) can also affect this behavior.

Cassandra LOCAL_QUORUM is waiting for remote datacenter responses

We have a 2 datacenters ( One in EU and one in US ) cluster with 4 nodes each deployed in AWS.
The nodes are separated in 3 racks ( Availability zones ) each.
In the cluster we have a keyspace test with replication: NetworkTopologyStrategy, eu-west:3, us-east:3
In the keyspace we have a table called mytable that has only one row 'id' text
Now, we were doing some tests on the performance of the database.
In CQLSH with a consistency level of LOCAL_QUORUM we were doing some inserts with TRACING ON and we noticed that the requests were not working as we expected them.
From the tracing data we found out that the coordinator node was hitting as expected 2 other local nodes and was also sending a request to one of the remote datacenter nodes. Now the problem here was that the coordinator was waiting not only for the local nodes ( who finished in no time ) but for the remote nodes too.
Now since our 2 datacenters are geographically far away from each other, our requests were taking a very long time to complete.
Notes:
- This does not happen with DSE but our understanding was we don't need to pay crazy money for LOCAL_QUORUM to work as is expected
There is a high probability that you're hitting CASSANDRA-9753 when the non-zero dclocal_read_repair_chance will trigger a query against remote DC. You need to check the trace for hint about triggering of read repair for your query. If you really get it, then you can set dclocal_read_repair_chance to 0 - this parameter is deprecated anyway...
For functional and performance tests it would be better to use the driver instead of CQLSH, as most of the time that will be the way that you are interacting with the database.
For this case, you may use a DC-aware policy like
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withLoadBalancingPolicy(
DCAwareRoundRobinPolicy.builder()
.withLocalDc("myLocalDC")
.build()
).build();
This is modified from the example here, where all the clauses that allow to interact with remote datacenters are removed, as your purpose is to isolate the calls to local.

Resources