How long does memsql upgrade takes? - singlestore

I have started an offline upgrade process to upgrade my MemSql Cluster from 5.8 To 6.5, Data size is around 300G it's been 5 hours already but i have lost all access to cluster and also there is no way to check the status.
memsql-ops memsql-list returns all leaves and aggregator shows online.
But, memsql> SHOW LEAVES; return empty set, my master aggregator automatically converted to child aggregator, so now i don't have any master aggregator.
I can't execute any command (Like AGGREGATOR SET AS MASTER) to child aggregator, it says 'memsql is not running as an aggregator', Or 'memsql node is not running', and sql query returns "The database 'xxx' is not available for queries, as it is waiting for the Master Aggregator to bring it online. Run SHOW DATABASES EXTENDED ..."
Also performing any management command like memsql-ops restart returns "Job cannot run because there is a MemSql upgrade intention with ID xxx is in progress"
Any information about this will be helpful as i am not able to find any related information online.
Thanks in advance...

We debugged the issue in MemSQL public chat and it was found that the Master Agg was running an unsupported beta version of MemSQL (6.0.0) which prevented the upgrade and then corrupted the database post upgrade.
For future readers please audit that you are not running beta versions of MemSQL on production clusters. If you are, not only will upgrade likely break, but it may not be possible to recover your data on a non-beta cluster.

Related

Does "spring data cassandra" have client side loadbalancing?

I'm operating project using spring-boot, spring-data-cassandra.
When I setup that project, I set cassandra properties by ip and port.
(referred by https://www.baeldung.com/spring-data-cassandra-tutorial)
When set it up like this, If I had 3 cassandra nodes and 1 cassandra node died, I think project should fail to connect with cassandra at a 33% probability.
But my project was fine even though 1 cassandra node was dead. (just have some error on one's deathbed)
Do It happen to have A function in spring-data-cassandra like client-side-loadbalancing?
If they have that function, Where can I see that code??
I tried to find that code but failed.
Please give me a little clue.
Spring Data Cassandra relies on the functionality of the DataStax Java driver that is responsible for making everything works. This includes:
establishing the initial connection to the cluster. This is where the contact points play their role. After driver is connected to any of points, it reads information about the whole cluster and establishes connections to all nodes (by default)
establishing the control connection that is used to receive notifications about changes in the cluster - nodes going up & down, changes in schema, etc. If node goes down or up, this information is used to modify the list of the active nodes
providing the load balancing of requests based on the replication, and nodes availability - if the node is down, it's excluded from list of candidates, so we don't send queries to node that is known to be down

what is RPC_READY property in cassandra nodetool gossipinfo output

I have a cassandra cluster of multiple nodes. When I do 'nodetool gossipinfo'. I see that 1 node has RPC_READY value different than others. All other nodes share the same value. Can anyone explain it to me what this property is and if is there any problem if the value is different on one node? I am using cassandra version 2.2.8.
I would appreciate the response.
Before 2.2 when a node goes up it would be broadcasted to all the nodes that its now in an UP state. This occurred sometimes before CQL was ready. The drivers listened for events like changes in state, when the node went up the drivers would try to connect to that node.
If they tried before CQL was ready the connection would fail and trigger a backoff which greatly increased time to connect to now up nodes. This caused the drivers state to be flipped from UP to DOWN with a bunch of log spam. The RPC_READY is a state that tracks if the node is actually ready for drivers to connect to. Jira here where it was added. In current version at least (I haven't looked at 2.2) the RPC_READY can change to false when being shutdown (drain) or when a node is being decommissioned as well.

Upgrade cassandra 2.1.19 cluster to 3.11.1

I want to upgrade cassandra 2.1.19 cluster to 3.11.1 without downtime.
Will 3.11.1 nodes work together with 2.1.19 nodes at the same time?
Key point will be how you connect to your cluster. You will need to try out on test systems if everything works from your application side doing the switch.
I recommend a two stop process in this case, migrate from 2.1.19 to 3.0.x - one node at atime.
For every node do the following (i said you need to test before before going to production right?):
nodetool drain - wait for finish
stop cassandra
backup your configs, the old one wont work out of the box
remove the cassandra package / tarball
read about the java and other cassandra 3.x requirements and ensure you met them
add the repo and install the 3.0.x package or tar ball
some packages start the node immediately - you may have to stop them again
make up the new config files (diff or something will be your friend, read the docs about the new options), one time only you should be able to resuse if on all the other nodes
start cassandra (did I say test this on a test system?) and wait until the node has joined the ring again nodetool status
upgrade your sstables with nodetool upgradesstables - almost always needed, dont skip this even if "something" works right now
this upgrade tends to be really slow - it's just a single thread running rewriting all your data, so i/o will be a factor here
all up and running -> go ahead to the next node and repeat
After that - upgrade 3.0.x to 3.11.x in the same fashion, add the new repo, configure for 3.11.x as for 3.0.x above and so on. But this time you can skip upgrading sstables as the format stays the same (but it wont harm if you do so).
Did I mention to do this on testing system first? One thing that will happen and may break things - older native protocols will be gone as well as rpc/ thrift.
Hope I didn't miss something ;)

memsql aggregator fail - how to recover the cluster

I have a memsql cluster with 4 child aggregators, 30 leaves and one master that failed. At this point i can't recover the master no matter what i am going to do. That instance is gone. I have promoted one of the child aggregators to master.
Once i connect to memsql and i run show databases; shoe leaves/aggregators ... everything is in place. However how do i manage to convert this child into a master? I mean , on the web UI the master appears running a freshly start cluster with zero leaves. Also i can't see any master folder created on the child aggregator that was promoted.
So my question is where am i going from here? For example if i want to restart the entire cluster how am i going to do it given the fact that from the child promoted node i get memsql-ops memsql-list
No MemSQL nodes were found ?
How will i performa the typical operations - update, restart?
It sounds like you have successfully promoted a child aggregator to master in the MemSQL cluster, but MemSQL Ops has lost all the cluster information because the Ops primary agent - which by default was on the same host as the Master Aggregator - is gone.
I'm not sure about your situation - did you promote a new Ops primary agent? - but in general, if you have a functioning MemSQL cluster and MemSQL Ops on all the nodes of the cluster, but Ops is not monitoring MemSQL (i.e. memsql-ops memsql-list is empty), you would run memsql-ops memsql-monitor for each MemSQL node to add them into Ops monitoring.
EDIT: answer was you haven't promoted a new Ops primary agent yet. In that case, here is what you should need to do.
Run memsql-ops unfollow on every node except the old primary
Choose a node to be the new primary - e.g. the new Master Aggregator.
Run memsql-ops follow -h NEW_PRIMARY_HOSTNAME on every node except the new primary
Run memsql-ops monitor -h NEW_MASTER_AGGREGATOR

Cassandra hangs on arbitrary commands

We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.

Resources