After restart my cassandra node does not start anymore. Ends with following error message.
ERROR 18:39:37 Unknown exception caught while attempting to update MaterializedView! findkita.kitas
java.lang.AssertionError: We shouldn't have got there is the base row had no associated entry
cassandra has heavy cpu usage and use 2,1 gb of memory there is be 1gb more available. I run nodetool cleanup and repair, but did not help.
I have 5 materialzied views on this table, but the amount of rows in table is under 2000, that is not much.
The cassandra runs in a docker container. The container is access able, but can not call cqlsh and my website cound not connect too
How can I fix the error? Is it possible to fix it?
I do not really fix it, but I get it run. My first container is now complitly crashed and is not start able anymore. But I had same problem with other container, that are enter able. I run apt-get update and apt-get upgrade and get cassandra work again.
It is not importand if there are any upgrades, only to run the command make cassandra call able again. Have to do it at each restart, but better as a completle crashed database.
Related
I am running a 29 node cluster spread over 4 DC's in EC2, using C* 3.11.1 on Ubuntu, using RF3. Occasionally I have the need to restart nodes in the cluster, but every time I do, I see errors and application (nodejs) timeouts.
I restart a node like this:
nodetool disablebinary && nodetool disablethrift && nodetool disablegossip && nodetool drain
sudo service cassandra restart
When I do that, I very often get timeouts and errors like this in my nodejs app:
Error: Cannot achieve consistency level LOCAL_ONE
My queries are all pretty much the same, things like: select * from history where ts > {current_time} (along with the partition key in the where clause)
The errors and timeouts seem to go away on their own after a while, but it is frustrating because I can't track down what I am doing wrong!
I've tried waiting between steps of shutting down cassandra, and I've tried stopping, waiting, then starting the node. One thing I've noticed is that even after nodetool draining the node, there are open connections to other nodes in the cluster (ie looking at the output of netstat) until I stop cassandra. I don't see any errors or warnings in the logs.
One other thing I've noticed is that after restarting a node and seeing application latency, I also see that the node I just restarted sees many other nodes in the same DC as being down (ie status 'DN'). However, checking nodetool status on those other nodes shows all nodes as up/normal. To me this could kind of explain the problem - node comes back online, thinks it is healthy but many others are not, so it gets traffic from the client application. But then it gets requests for ranges that belong to a node it thinks is down, so it responds with an error. The latency issue seems to start roughly when the node goes down, but persists long (ie 15-20 mins) after it is back online and accepting connections. It seems to go away once the bounced node shows the other nodes in the same DC as up again.
I have not been able to reproduce this locally using ccm.
What can I do to prevent this? Is there something else I should be doing to gracefully restart the cluster? It could be something to do with the nodejs driver, but I can't find anything there to try.
I seem to have been able to resolve the issue by issuing nodetool disablegossip as the last step in shutting down. So using this instead of my initial approach at restarting seems to work (note that only the order of drain and disablegossip have switched):
nodetool disablebinary
nodetool disablethrift
nodetool drain
nodetool disablegossip
sudo service cassandra restart
While this seems to work, I have no explanation as to why. On the mailing list, someone helpfully pointed out that the drain should take care of everything that disablegossip does, so my hypothesis is that doing the disablegossip first causes the drain to then have problems which only appear after startup.
I am testing DSE Graph (using DSE 5.0.7) a on a single node and managed to corrupt it completely. As a result I wiped out all the data files with the intention of rebuilding everything from scratch. On the first restart of Cassandra I forgot to include the -G option but Cassandra came up fine and was viewable from Opscenter, nodetool etc. I shut this down, and cleared out the data directories and restarted Cassandra again, this time with the -G option. It starts up and then shuts itself down with the following warning written to the log:
WARN [main] 2017-06-08 12:59:03,157 NoSpamLogger.java:97 - Failed to create lease HadoopJT.Graph. Possible causes include network/C* issues, the lease being disabled, insufficient replication (you created a new DC and didn't ALTER KEYSPACE dse_leases) and the duration (30000) being different (you have to disable/delete/recreate the lease to change the duration).
java.io.IOException: No live replicas for lease HadoopJT.Graph in table dse_leases.leases Nodes [/10.28.98.53] are all down/still starting.
at com.datastax.bdp.leasemanager.LeasePlugin.getRemoteMonitor(LeasePlugin.java:538) [dse-core-5.0.7.jar:5.0.7]
After this is the error
ERROR [main] 2017-06-08 12:59:03,182 Configuration.java:2646 - error parsing conf dse-core-default.xml
org.xml.sax.SAXParseException: Premature end of file.
with a 0 byte dse-core-default.xml being created. Deleting this and retrying yields the same results so I suspect this is a red herring.
Anyone have any idea how to fix this short of reinstalling everything from scratch?
Looks like this might be fixed by removing a very large java_pidnnnnn.hprof file that was sitting in the bin directory. Not sure why this fixed the problem, if anyone has any idea?
I just have one reason to restart cluster below :
All the nodes have the same hardware configuration
1. When i update file cassandra.yaml
Are there other reasons ?
The thing you are asking for is Rolling Restart a cassandra cluster. There are so many reason to restart a cassandra cluster. I'm just mentioning some below-
when you update any value in cassandra.yaml. (As you mentioned above)
When your nodetool got stucked somehow. such as- you gave command nodetool repair and cancelled the command but it got stucked behind, then you won't be able to give another nodetool repair command.
When you are adding a new node to cluster and you got stream_failed due to nproc limit. That time your running cluster nodes could be down to this issue and going to hold the status.
When you don't want to use sstableloader and you need to restore your data from snapshots. That time you need to provide your snapshots to the data_directory on each node and rolling restart.
When you are about to upgrade your cassandra_version.
For example when you upgrading Cassandra version.
I'm using memsql 5.0.8 community version. I just randomly get the ER_STMT_CACHE_FULL error, All I can do is reboot the server. How can I increase it.
MemSQL doesn't have a statement cache and never generates this error. You are likely running your application against MySQL or MariaDB
We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.