I am testing DSE Graph (using DSE 5.0.7) a on a single node and managed to corrupt it completely. As a result I wiped out all the data files with the intention of rebuilding everything from scratch. On the first restart of Cassandra I forgot to include the -G option but Cassandra came up fine and was viewable from Opscenter, nodetool etc. I shut this down, and cleared out the data directories and restarted Cassandra again, this time with the -G option. It starts up and then shuts itself down with the following warning written to the log:
WARN [main] 2017-06-08 12:59:03,157 NoSpamLogger.java:97 - Failed to create lease HadoopJT.Graph. Possible causes include network/C* issues, the lease being disabled, insufficient replication (you created a new DC and didn't ALTER KEYSPACE dse_leases) and the duration (30000) being different (you have to disable/delete/recreate the lease to change the duration).
java.io.IOException: No live replicas for lease HadoopJT.Graph in table dse_leases.leases Nodes [/10.28.98.53] are all down/still starting.
at com.datastax.bdp.leasemanager.LeasePlugin.getRemoteMonitor(LeasePlugin.java:538) [dse-core-5.0.7.jar:5.0.7]
After this is the error
ERROR [main] 2017-06-08 12:59:03,182 Configuration.java:2646 - error parsing conf dse-core-default.xml
org.xml.sax.SAXParseException: Premature end of file.
with a 0 byte dse-core-default.xml being created. Deleting this and retrying yields the same results so I suspect this is a red herring.
Anyone have any idea how to fix this short of reinstalling everything from scratch?
Looks like this might be fixed by removing a very large java_pidnnnnn.hprof file that was sitting in the bin directory. Not sure why this fixed the problem, if anyone has any idea?
Related
I have below cassandra query ;
Few days ago i have developed application using c# and Single node Cassandra db. While the application in production, power failure occurred and cassandra commitlog got corrupt. Because of it cassandra node not starting, so i have shifted all commitlog files to another directory and started the cassandra node.
Recently i noticed the power failure day's data not available in database, I have all commitlog files with corrupted commitlog file name.
Can you please suggest, is there a way to recover data using commitlog files.
As well how to avoid commitlog file corruption issue, so that in production data loss can be avoid.
Thank you.
There is no way to restore back the node to the previous state if your commit logs have got corrupted and you have no SSTables with you.
If your commit logs are healthy (meaning it's not corrupted), then you just need to restart your node . It will be replayed,as a result will rebuild the memtable(s) and flush generation-1 SSTables on the disk.
What you can ideally do is to forcibly create SSTables.
You can do that under the apache-cassandra/bin directory by
nodetool flush
So if you are wary of losing commit logs .You can rebuild your node to previous states using SSTables so created above using
nodetool.bat refresh [keyspace] [columnfamily].
Alternatively you can also try creating snapshots.
nodetool snapshot
This command will take a snapshot of all keyspaces on the node.You also have the option of creating backups but this one will only keep record of the latest operations.
For more info try reading
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html
I suggest you can also try having more nodes and thus increase the replication factor to avoid such scenarios in future.
Hope it helps!
I have a single-node Cassandra setup for my application. To reclaim disk space occupied by deleted records (tombstoned records), I triggered a nodetool compact for my keyspace. Unfortunately, this compaction process got interrupted. Now, when I try to re-start the service, it does not recognise the keyspace (from the data directory configured in cassandra.yaml) for which compaction was in progress when it got interrupted. Other keyspaces like system and system_traces are successfully initiated from the same data directory.
Has anybody encountered a similar issue before? Also, pointers to restore a keyspace only from data files would be of great help (for the lack of maintenance of snapshots).
PS: Upon analysing further it was found that an rm command on the cassandra data directory was issued but immediately cancelled. Most of the data seems to be in place, but there is a chance that the Data.db file of the system keyspace was lost. Is there a way to recover from this state?
Seems like you have corrupted your setup by deleting System keyspace files, hence Cassandra might not be checking the same at boot time.
Try this:
Download same version of cassandra again.
Create your keyspace & cf schemas
Move whatever old data is left to new data directory(cassandra will only load the non-corrupted data) -
sudo mv /data/cassandra_old/data/[keyspace]/[cf]-[md5-old]/* /data/cassandra_new/data/[keyspace]/[cf]-[md5-new]/
It should solve it if I understand the problem correctly.
After restart my cassandra node does not start anymore. Ends with following error message.
ERROR 18:39:37 Unknown exception caught while attempting to update MaterializedView! findkita.kitas
java.lang.AssertionError: We shouldn't have got there is the base row had no associated entry
cassandra has heavy cpu usage and use 2,1 gb of memory there is be 1gb more available. I run nodetool cleanup and repair, but did not help.
I have 5 materialzied views on this table, but the amount of rows in table is under 2000, that is not much.
The cassandra runs in a docker container. The container is access able, but can not call cqlsh and my website cound not connect too
How can I fix the error? Is it possible to fix it?
I do not really fix it, but I get it run. My first container is now complitly crashed and is not start able anymore. But I had same problem with other container, that are enter able. I run apt-get update and apt-get upgrade and get cassandra work again.
It is not importand if there are any upgrades, only to run the command make cassandra call able again. Have to do it at each restart, but better as a completle crashed database.
My Situation:
I have a server with multiple hard disks.
If i install cassandra(2.1.9) on the server and use all the hard disks.
What happens if one hard disk goes down?
Will it black list only that (Hard disk)partition and move the partitions(cassandra partitions) to other nodes or to the system partitions on same node.
Will it treat as if the entire node went down.
The behavior is configured in cassandra.yaml using the disk_failure_policy setting. See documentation here.
disk_failure_policy: (Default: stop) Sets how Cassandra responds to disk failure.
Recommend settings are stop or best_effort.
die - Shut down gossip and Thrift and kill the JVM for any file system errors
or single SSTable errors, so the node can be replaced.
stop_paranoid - Shut down gossip and Thrift even for single SSTable errors.
stop - Shut down gossip and Thrift, leaving the node effectively dead,
but available for inspection using JMX.
best_effort - Stop using the failed disk and respond to requests based on
the remaining available SSTables. This means you will see obsolete data
at consistency level of ONE.
ignore - Ignores fatal errors and lets the requests fail; all file system
errors are logged but otherwise ignored. Cassandra acts as in versions
prior to 1.2.
You can find documentation on how to recover from a disk failure here. Cassandra will not automatically move data from a failed disk to the good disks. It requires manual intervention to correct the problem.
We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.
All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.
Versions:
Connected to ### at ####.compute-1.amazonaws.com:9160.
[cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]
I've run out out of ideas. Any hints would be greatly appreciated.
Update:
After a bit of random investigating, here's a bit more detailed description.
If I cassandra-cli to any machine, and do "show keyspaces", it works.
If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.
If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.
If I cqlsh to a local cassandra, and do a describe keyspaces, it works.
If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.
If I do a consistency all, and select the same, it hangs.
If I do a trace, different machines return the data, though sometimes source_elapsed is "null"
Not to forget, applications querying the cluster sometimes do get results, after several attempts.
Update 2
Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..
Solution
Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.
If someone also stumbles here:
http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/
In cassandra.yaml:
Switch rpc_server_type from hsha to sync.