I've 5 node Cass cluster and each one currently owning 1 TB of data. When I tried adding another node, it almost took 15+ Hours to bring to 'UN' State.
Is there a way to make it fast?
Cassandra version: 3.0.13
Environment : AWS , m4.2xlarge machines.
1 TB is a lot of data per node. Since you have a 5 node cluster and you're adding a new node that node will take 0,833 TB data that has to be streamed from all nodes. That is the equivalent of 6,67 Tbit, or 6990507 Mbit. Cassandra has a default value for stream_throughput_outbound_megabits_per_sec of 200. 6990507รท200 = 34952,535 seconds = 9,7 hours to transfer all data. Since you're probably running other traffic at the same time etc. this could very well take 15 hours.
Solution: Change the stream_throughput_outbound_megabits_per_sec on all nodes to a higher value.
Note: Don't forget to run nodetool cleanup after the node has joined the cluster.
Related
I have a cassandra cluster of having single data center with 2 nodes.Two nodes are in two racks.First node having 6.6Gib and Second node having 5.9 Gib.Second node often turns down.When i restart the server it takes alot of time to turns into up.So can anyone help me out of this situation
I'm trying to add new node to our cluster (cassandra 2.1.11, 16 nodes, 32Gb ram, 2x3Tb hdd, 8core cpu, 1 datacenter, 2 racks, about 700Gb of data on each node). After start of new node, data (approx 600Gb total) from 16 existing nodes successfully transfered to new node and building of secondary indexes starts. The process of secondary indexes building looks normal, i see info about successfull completition of some secondary indexes building and some stream tasks:
INFO [StreamReceiveTask:9] 2015-11-22 02:15:23,153 StreamResultFuture.java:180 - [Stream #856adc90-8ddd-11e5-a4be-69bddd44a709] Session with /192.168.21.66 is complete
INFO [StreamReceiveTask:9] 2015-11-22 02:15:23,152 SecondaryIndexManager.java:174 - Index build of [docs.docs_ex_pl_ph_idx, docs.docs_lo_pl_ph_idx, docs.docs_author_login_idx, docs.docs_author_extid_idx, docs.docs_url_idx] complete
Curently 9 out of 16 streams successfully finished, according to logs. Everything looks fine, except one issue: this process already lasts 5 full days. There is no errors in logs, no anything suspicious, except extremely slow progress.
nodetool compactionstats -H
shows
Secondary index build ... docs 882,4 MB 1,69 GB bytes 51,14%
So there is some process of index building and it has some progress, but very slow, 1% in half a hour or so.
The only significant difference between the new node and any of existing nodes is the fact that cassandra java process has 21k open files, in contrast of 300 open files on any existing node, and 80k files in the data dir on new node in contrast of 300-500 files in the data dir on any existing node.
Is it normal? At this speed it looks i'll spend 16 weeks or so to add 16 more nodes.
I know this is an old question, but we ran into this exact issue with 2.1.13 using DTCS. We were able to fix it in our test environment by increasing memtable flush thresholds to 0.7 - which didn't make any sense to us, but may be worth trying.
I am using community edition of memsql. I got this error while i was running a query today. So i just restarted my cluster and got this error solved.
memsql-ops cluster-restart
But what happened and what should i do in future to avoid this error ?
NOTE
I donot want to buy the Enterprise edition.
Question
Is this problem of Availability ?
I got this error when experimenting with performance.
VM had 24 CPUs and 25 nodes: 1 Master Agg, 24 Leaf nodes
Reduced VM to 4 CPUs and restarted cluster.
All the leaves did not recover.
All except 4 recovered in < 5 minutes.
20 minutes later, 4 leaf nodes still were not connected.
From MySQL/MemSQL prompt:
use db;
show partitions;
I notice some partitions with ordinal from 0-71 for me have null instead Host, Port, Role defined for a given partition.
In memsql ops UI http://server:9000 > Settings > Config > Manual Cluster Control I checked "ENABLE MANUAL CONTROL" while I tried to run various commands with no real benefit.
Then 15 minutes later, I unchecked the box, Memsql-ops tried attaching all the leaf nodes again and was finally successful.
Perhaps a cluster restart would have done the same thing.
This happened because a leaf in your cluster has failed a health check heartbeat for some reason (loss of network connectivity, hardware failure, OS issue, machine overloaded, out of memory, etc.) and its partitions are no longer accessible to query. MemSQL Community Edition only supports redundancy 1 so there are no other copies of the data on the failed leaf node in your cluster (thus the error about missing a partition of data - MemSQL can't complete a query that needs to read data on any partitions on the problem leaf).
Given that a restart repaired things, the most likely answer is that linux "out of memory" killed you: MemSQL Linux OOM killer docs
You can also check the tracelog on the leaf that ran into issues to see if there is any clue there about what happened (It's usually at /var/lib/memsql/leaf_3306/tracelogs/memsql.log)
-Adam
I too have faced this error, that was because some of the slave ordinals had no corresponding masters. My error message looked like:
ERROR 1772 (HY000) at line 1: Leaf Error (10.0.0.112:3306): Partition database `<db_name>_0` can't be promoted to master because it is provisioning replication
My memsql> SHOW PARTITIONS; command returned the following.
So what approach I followed was to remove each of such cases (where the role was either Slave or NULL).
DROP PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
DROP PARTITION <db_name>:46 ON "10.0.0.193":3306;
And then created a new partition with each of the dropped partition.
CREATE PARTITION <db_name>:4 ON "10.0.0.193":3306;
..
CREATE PARTITION <db_name>:46 ON "10.0.0.193":3306;
And this was the result of memsql> SHOW PARTITIONS; after that.
You can refer to the MemSQL Documentation regarding partitions, here if the above steps doesn't seem to solve your problem.
I was hitting the same problem. Using the following command in the master node, solved the problem:
REBALANCE PARTITIONS ON db_name
Optionally you can force it using FORCE:
REBALANCE PARTITIONS ON db_name FORCE
And to see the list of operations when rebalancing is going to execute, use above command with EXPLAIN:
EXPLAIN REBALANCE PARTITIONS ON db_name [FORCE]
Been using a 6GB dataset with each source record being ~1KB in length when I accidentally added an index on a column that I am pretty sure has a 100% cardinality.
Tried dropping the index from cqlsh but by that point the two node cluster had gone into a run away death spiral with loadavg surpassing 20 on each node and cqlsh hung on the drop command for 30 minutes. Since this was just a test setup, I shut-down and destroyed the cluster and restarted.
This is a fairly disconcerting problem as it makes me fear a scenario where a junior developer is on a production cluster and they set an index on a similar high cardinality column. I scanned through the documentation and looked at the options in nodetool but there didn't seem to be anything along the lines of "abort job or abort building index".
Test environment:
2x m1.xlarge EC2 instances with 2 Raid 0 ephemeral disks
Dataset was 6GB, 1KB per record.
My question in summary: Is it possible to abort the process of building a secondary index AND or possible to stop/postpone running builds (indexing, compaction) for a later date.
nodetool -h node_address stop index_build
See: http://www.datastax.com/docs/1.2/references/nodetool#nodetool-stop
As part of a production problem we have, we want to upgrade our production 1.0.9 cluster to 1.2.1/2.
While testing the upgrade in our dev environment, I'm getting the following exceptions while doing a simple list operation using cli on one of the upgraded cluster nodes: (Just one node out of 3 is upgraded)
[default#testKS] list testCF;
Using default limit of 100
Using default column limit of 100
null
TimedOutException()
at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12932)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734)
at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718)
at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1485)
at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:272)
at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:210)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:337)
I get the same exception running the same list operation on a non-upgraded node (1.0.9):
[default#testKS] list testCF;
Using default limit of 100
null
TimedOutException()
at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12830)
at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:762)
at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:734)
at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1390)
at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:269)
at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:220)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:348)
When trying the same list operation on another 1.0.9 node I get no errors, I assume this node holds the only replica:
[default#testKS] list testCF;
Using default limit of 100
-------------------
RowKey: 0a
=> (column=0a, value=0a, timestamp=1362642828623000)
1 Row Returned.
Elapsed time: 11 msec(s).
The testCF I'm using has just one key and value and the timeout exception
is consistent.
I'm using replication factor of 1, but all three nodes seems to be up and
synced.
It seems like it's the simplest upgrade one can make from 1.0.9 from
1.2.1/2 and it's still failing to use range(?) scans.
The same error occurs when using select * queries from cql client.
I didn't do any special test so I'm guessing this problem should occur in
any upgrade from 1.0.9 and it will be quite easy to reproduce.
nodetool ring shows all nodes are up (running on all nodes):
-bash-4.1$ nodetool -h localhost ring
Datacenter: US
==========
Replicas: 1
Address Rack Status State Load Owns Token
113427455640312821154458202477256070485
33.33.33.2 RAC1 Up Normal 39.31 KB 33.33% 0
33.33.33.3 RAC1 Up Normal 63.39 KB 33.33% 56713727820156410577229101238628035242
33.33.33.4 RAC1 Up Normal 63.39 KB 33.33% 113427455640312821154458202477256070485
Once I finish the upgrade on all nodes, the cluster goes back to normal.
Why is that?
It should not be that way? Right?
As my understanding, Cassandra should be able to support a rolling upgrade such as this, Right?
If anyone encountered this problem, or thinks he knows what the problem might be,
Please advise,
Or.
Rolling upgrades are possible, and I do them all the time. You should probably not jump 2 major releases 1.0 - 1.2. I know its classified as a point release, but a lot changed.
You should do a rolling upgrade to 1.1.X, then run upgradesstables on all nodes, then repair on all nodes.
Then rolling upgrade to 1.2.X, then run upgradesstables on all nodes, then repair on all nodes.
So, rolling update is supported, but you should do it in a rational fashion to reduce risk. Your 3 nodes cluster shouldn't take long at all.