gluster reduce a 2 x 2 node volume to a 1 x 2 node volume - glusterfs

I have a simple gluster setup where 4 servers each have 1 brick.
I'd like to take two servers out of action and simply have 2 servers with replicated data.
I've tried
gluster volume remove-brick gv0 machine1:/export/brick1 machine2:/export/brick1
however I get the error
volume remove-brick commit force: failed: Bricks not from same subvol for replica
How do I go about this?
FYI
gluster volume info gv0
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 75a37568-67e7-4bf9-8b74-fabfa8487e97
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: machine3:/export/brick1
Brick2: machine2:/export/brick1
Brick3: machine1:/export/brick1
Brick4: machine4:/export/brick1
Thanks

While removing/adding the gluster brick you should provide the correct replica number within the remove/add command, ie, while adding new brick provide replica number as N+1 , ie, N is the number of bricks that exists, and while removing replica number will be N-1. Then it'll work.
here, we have 4 number of bricks, and from that we are going to remove 2, so new replica number will be 4-2=2. and provide 'force' option at the end.
gluster volume remove-brick gv0 replica 2 machine1:/export/brick1 machine2:/export/brick1 force
or as two seperate commands.
gluster volume remove-brick gv0 replica 3 machine1:/export/brick1 force
gluster volume remove-brick gv0 replica 2 machine2:/export/brick1 force

Can't you simply take the servers out of the pool with:
gluster peer detach machine1
Late reply I know and you may have already figured it out?

Related

Nodetool load and own stats

We are running 2 nodes in a cluster - replication factor 1.
After writing a burst of data, we see the following via node tool status.
Node 1 - load 22G (owns 48.2)
Node 2 - load 17G (owns 51.8)
As the payload size per record is exactly equal - what could lead to a node showing higher load despite lower ownership?
Nodetool status uses the Owns column to indicate the effective percentage of the token range owned by the nodes. While GB is Size of your records
Dont see anything wrong here. Your data is almost evenly distributed around your two nodes which is exactly what you want for perfekt performance.

Can I change GlusterFS replica 3 to replica 3 with arbiter 1?

I configured a volume as a replica 3 and now I want to convert it to a replica 3 with arbiter 1.
I cannot seem to locate any information on if this is possible or if I need to move my data, destroy the volume and recreate it.
I am running glusterfs 4.1.4
Volume Name: clustered_sites
Type: Replicate
Volume ID: 34ef4f5b-497a-443c-922b-b168729ac1c6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: www-one-internal:/mnt/clustered_sites/brick1
Brick2: www-two-internal:/mnt/clustered_sites/brick1
Brick3: www-three-internal:/mnt/clustered_sites/brick1
Options Reconfigured:
cluster.consistent-metadata: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
Yes, you can. Ensure that there are no pending heals before you do so.
gluster volume remove-brick clustered_sites replica 2 www-three-internal:/mnt/clustered_sites/brick1 force
gluster volume add-brick clustered_sites replica 3 arbiter 1 www-three-internal:/mnt/clustered_sites/new_arbiter_brick

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

ERROR 1777 (HY000): Partition memsqldb:0 has no master instance

I am using community edition of memsql. I got this error while i was running a query today. So i just restarted my cluster and got this error solved.
memsql-ops cluster-restart
But what happened and what should i do in future to avoid this error ?
NOTE
I donot want to buy the Enterprise edition.
Question
Is this problem of Availability ?
I got this error when experimenting with performance.
VM had 24 CPUs and 25 nodes: 1 Master Agg, 24 Leaf nodes
Reduced VM to 4 CPUs and restarted cluster.
All the leaves did not recover.
All except 4 recovered in < 5 minutes.
20 minutes later, 4 leaf nodes still were not connected.
From MySQL/MemSQL prompt:
use db;
show partitions;
I notice some partitions with ordinal from 0-71 for me have null instead Host, Port, Role defined for a given partition.
In memsql ops UI http://server:9000 > Settings > Config > Manual Cluster Control I checked "ENABLE MANUAL CONTROL" while I tried to run various commands with no real benefit.
Then 15 minutes later, I unchecked the box, Memsql-ops tried attaching all the leaf nodes again and was finally successful.
Perhaps a cluster restart would have done the same thing.
This happened because a leaf in your cluster has failed a health check heartbeat for some reason (loss of network connectivity, hardware failure, OS issue, machine overloaded, out of memory, etc.) and its partitions are no longer accessible to query. MemSQL Community Edition only supports redundancy 1 so there are no other copies of the data on the failed leaf node in your cluster (thus the error about missing a partition of data - MemSQL can't complete a query that needs to read data on any partitions on the problem leaf).
Given that a restart repaired things, the most likely answer is that linux "out of memory" killed you: MemSQL Linux OOM killer docs
You can also check the tracelog on the leaf that ran into issues to see if there is any clue there about what happened (It's usually at /var/lib/memsql/leaf_3306/tracelogs/memsql.log)
-Adam
I too have faced this error, that was because some of the slave ordinals had no corresponding masters. My error message looked like:
ERROR 1772 (HY000) at line 1: Leaf Error (10.0.0.112:3306): Partition database `<db_name>_0` can't be promoted to master because it is provisioning replication
My memsql> SHOW PARTITIONS; command returned the following.
So what approach I followed was to remove each of such cases (where the role was either Slave or NULL).
DROP PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
DROP PARTITION <db_name>:46 ON "10.0.0.193":3306;
And then created a new partition with each of the dropped partition.
CREATE PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
CREATE PARTITION <db_name>:46 ON "10.0.0.193":3306;
And this was the result of memsql> SHOW PARTITIONS; after that.
You can refer to the MemSQL Documentation regarding partitions, here if the above steps doesn't seem to solve your problem.
I was hitting the same problem. Using the following command in the master node, solved the problem:
REBALANCE PARTITIONS ON db_name
Optionally you can force it using FORCE:
REBALANCE PARTITIONS ON db_name FORCE
And to see the list of operations when rebalancing is going to execute, use above command with EXPLAIN:
EXPLAIN REBALANCE PARTITIONS ON db_name [FORCE]

Is it normal to get a lot of heal-failed entries in a gluster mount?

I run:
gluster volume heal myvol info heal-failed
and I get back a whole bunch of entries. Is this normal? Is anyone else out there seeing this in their implementation of glusterfs? If so, how do you go about resolving this?
List of entries from "gluster volume heal myvol info heal-failed" can be real failure or it could just list the entries which self-heal-daemon failed to self-heal in that crawl.
Gradually the file/directory which is listed under "heal-failed" entry would be self-healed by self-heal-daemon.
It is normal to see heal-failed entries.

Resources