I configured a volume as a replica 3 and now I want to convert it to a replica 3 with arbiter 1.
I cannot seem to locate any information on if this is possible or if I need to move my data, destroy the volume and recreate it.
I am running glusterfs 4.1.4
Volume Name: clustered_sites
Type: Replicate
Volume ID: 34ef4f5b-497a-443c-922b-b168729ac1c6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: www-one-internal:/mnt/clustered_sites/brick1
Brick2: www-two-internal:/mnt/clustered_sites/brick1
Brick3: www-three-internal:/mnt/clustered_sites/brick1
Options Reconfigured:
cluster.consistent-metadata: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
Yes, you can. Ensure that there are no pending heals before you do so.
gluster volume remove-brick clustered_sites replica 2 www-three-internal:/mnt/clustered_sites/brick1 force
gluster volume add-brick clustered_sites replica 3 arbiter 1 www-three-internal:/mnt/clustered_sites/new_arbiter_brick
Related
I am building a Service and I want to deploy it to my local 1-Node-Cluster. As soon as I publish my Service to the local cluster, the cluster fails with the following error:
Partitions - Error - Unhealthy partitions: 100% (1/1), MaxPercentUnhealthyPartitionsPerService=0%.
Partition - Error - Unhealthy partition: PartitionId='...', AggregatedHealthState='Error'.
Event - Error - Error event: SourceId='System.FM', Property='State'. Partition is below target replica or instance count. fabric:/MyFabric/MyFabricService 1 1 ...
(Showing 0 out of 0 replicas. Total up replicas: 0.)
The strange thing is this: I set a breakpoint at the very beginning of my Main-Method, and it never triggers, i.e. the Main Method of my Service is never called.
I read that comparable issues might show up when you have to little memory or disc space available. The system I am running on has 10 GB RAM and 80 GB free disc space, so I think this cannot be the issue here.
Any idea what might be wrong? Ideas how to fix it?
Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.
As part of the upgrade of Datastax Entreprise from 4.6.1 to 4.7, I am upgrading Cassandra from 2.0.12.200 to 2.1.5.469. The overall DB size is about 27GB.
One step in the upgrade process is running nodetool upgradesstables on a seed node. However this command has been running for more than 48 hours now and looking at CPU and IO, this node seems to be effectively idle.
This is the current output from nodetool compactionstats. The numbers don't seem to change over the course of a few hours :
pending tasks: 28
compaction type keyspace table completed total unit progress
Upgrade sstables prod_2_0_7_v1 image_event_index 8731948551 17452496604 bytes 50.03%
Upgrade sstables prod_2_0_7_v1 image_event_index 9121634294 18710427035 bytes 48.75%
Active compaction remaining time : 0h00m00s
And this is the output from nodetool netstats:
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 0
Responses n/a 0 0
My question is, is this normal behavior? Is there a way for me to check if this command is stuck or if it is actually still doing work?
If it is stuck, what's the best course of action?
For what it's worth this node is an M3-Large instance on Amazon EC2 with all the data on the local ephemeral disk (SSD).
I have a simple gluster setup where 4 servers each have 1 brick.
I'd like to take two servers out of action and simply have 2 servers with replicated data.
I've tried
gluster volume remove-brick gv0 machine1:/export/brick1 machine2:/export/brick1
however I get the error
volume remove-brick commit force: failed: Bricks not from same subvol for replica
How do I go about this?
FYI
gluster volume info gv0
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 75a37568-67e7-4bf9-8b74-fabfa8487e97
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: machine3:/export/brick1
Brick2: machine2:/export/brick1
Brick3: machine1:/export/brick1
Brick4: machine4:/export/brick1
Thanks
While removing/adding the gluster brick you should provide the correct replica number within the remove/add command, ie, while adding new brick provide replica number as N+1 , ie, N is the number of bricks that exists, and while removing replica number will be N-1. Then it'll work.
here, we have 4 number of bricks, and from that we are going to remove 2, so new replica number will be 4-2=2. and provide 'force' option at the end.
gluster volume remove-brick gv0 replica 2 machine1:/export/brick1 machine2:/export/brick1 force
or as two seperate commands.
gluster volume remove-brick gv0 replica 3 machine1:/export/brick1 force
gluster volume remove-brick gv0 replica 2 machine2:/export/brick1 force
Can't you simply take the servers out of the pool with:
gluster peer detach machine1
Late reply I know and you may have already figured it out?
I was testing rotating through a 4 node cluster, adding and removing nodes in a cyclic manner so the members of the cluster adhered to the following repeating sequence
1 2 3
2 3
2 3 4
3 4
1 3 4
1 4
1 2 4
1 2
1 2 3
2 3
2 3 4
3 4
1 3 4
1 4
...
Node addition was performed by stopping cassandra, wiping /var/lib/cassandra/*, and restarting cassandra (with the same cassandra.yaml file, which listed nodes 1 and 2 as seeds). Node removal was performed by stopping cassandra and then issueing nodetool removenode $nodeId from another node. In all cases, the next operation was not started until the previous one was completed.
The above sequence of node members repeated several times until after about 4 iterations I was performing an "add node" operation to transtion from a cluster of nodes {1, 2} to a cluster of nodes {1, 2, 3}. On this iteration, my custom keyspace failed to propagate to node 3. Nodetool status looked fine:
$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.12.206 164.88 KB 256 66.2% 7018ef8a-af08-40e9-b3d3-065f4ba6eb0d rack1
UN 192.168.12.207 60.85 KB 256 63.2% ff18b636-6287-4c70-bf23-0a1a1814b864 rack1
UN 192.168.12.205 217.19 KB 256 70.6% 2bc38fa8-42a1-457f-84d7-35b3b46e1daa rack1
But cqlsh on node 3 didn't know about my keyspace. I tried to run nodetool repair, which seemed to loop infinitely, while spewing the following couple of stacks in the log:
WARN [Thread-9781] 2014-09-16 19:34:30,081 IncomingTcpConnection.java (line 83) UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=08768b1d-97a1-3528-8191-9acee7b08ef4
at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:178)
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:103)
at org.apache.cassandra.service.paxos.Commit$CommitSerializer.deserialize(Commit.java:145)
at org.apache.cassandra.service.paxos.Commit$CommitSerializer.deserialize(Commit.java:134)
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:153)
at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:130)
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
ERROR [Thread-9782] 2014-09-16 19:34:31,484 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-9782,5,main]
java.lang.NullPointerException
at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:247)
at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:156)
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:153)
at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:130)
at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:74)
Any ideas what's going on and how to fix this (ideally, a reliable working repair and a way to avoid entering this state in the first place)?
If there is a schema version disagreement you can tell by running nodetool describecluster
If you are seeing different versions in one node do the following node that has the wrong version:
stop the Cassandra service/process, typically by running: nodetool drain
sudo service cassandra stop or kill <pid>.
At the end of this process the commit log directory (/var/lib/cassandra/commitlog) should contain only a single small file.
Remove the Schema* and Migration* sstables inside of your system keyspace (/var/lib/cassandra/data/system, if you're using the defaults).
After starting Cassandra again, this node will notice the missing information and pull in the correct schema from one of the other nodes. In version 1.0.X and before the schema is applied one mutation at a time. While it is being applied the node may log messages, such as the one below, that a Column Family cannot be found. These messages can be ignored.
ERROR [MutationStage:1] 2012-05-18 16:23:15,664 RowMutationVerbHandler.java (line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1012
To confirm everything is on the same schema, verify that 'describe cluster;' only returns one schema version.
Source: https://wiki.apache.org/cassandra/FAQ