When restarting a Cassandra node a lot of time is spend on replaying the commitlog to achieve consistency. In our application, it is more important to bring the node back up and running fast, than to achieve consistency. Therefore we have set “durable_writes = false” on all our manually created keyspaces to disable the commitlog. (We have not touched the system keyspaces). Nevertheless, when we restart a note it still uses about one hour on replaying the commitlog.
What is left in my commitlog?
Can I in any way investigate the content of the commitlog?
How can the commitlog be turned off (if not durable_writes = false)?
durable_writes is set per keyspace, so if there are any keyspaces with it still enabled there will still be mutations in the commitlogs to replay on startup. You may want to walk output of describe schema.
There are some tables (ie system) that you want to keep durable, but it shouldn't have that much to cause an impact to startup. When starting up it logs out which keyspace/tables its reading so you can check which ones its replaying.
One hour is a very long time and has a certain smell to it, there may be something else going on here and probably warrants additional investigation. Some ideas is to check the logs and make sure it is the commitlog replay thats taking time (not rebuilding index summaries or something). Also check that there are not old commit logs that C* doesn't have permissions to delete or something that would stick around.
do 'nodetool drain' before shutting down the node.This will write all the commitlogs to sstables.
Related
As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL
I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.
I started to use cassandra 3.7 and always I have problems with the commitlog. When the pc unexpected finished by a power outage for example the cassandra service doesn't restart. I try to start for the command line, but always the error cassandra could not read commit log descriptor in file appears.
I have to delete all the commit logs to start the cassandra service. The problem is that I lose a lot of data. I tried to increment the replication factor to 3, but is the same.
What I can do to decrease amount of lost data?
pd: I only one pc to use cassandra database, it is not possible to add more pcs.
I think your option here is to work around the issue since its unlikely there is a guaranteed solution to prevent commit table files getting corrupted on sudden power outage. Since you only have a single node, it makes it more difficult to recover the data. Increasing the replication factor to 3 on a single node cluster is not going to help.
One thing you can try is to reduce the frequency at which the memtables are flushed. On flush of memtable the entries in the commit log are discarded, therefore reducing the amount of data lost. Details here. This will however not resolve the root issue
I do know that it's a cassandra anti-pattern to delete rows (and more so – doing it frequently), but in my simple use case I have a local cassandra (single instance, replication factor set to 1) that I use for unit tests, which drop all tables before running, naturally to perform the tests with a clean slate.
Over time, the performance of this cassandra instance degraded extremely. It surprised me a bit that dropping the keyspaces althogether didn't help at all. Only by manually deleting everything in cassandra data directory I managed to recover all the performance.
This solution is quite fine for me as I don't care for the test data I delete over and over again, but it certainly feels a bit weird to have to delete these things manually on file system. Is there a better way to deal with such situation? Or am I going about this whole case completely wrong?
Based on the little information provided, I will provide some info:
First, deleting data creates tombstones in cassandra. The default behavior is to keep these tombstones for 10 days, set by the variable gc_grace_seconds.
Given you only have 1 node and don't care about the data once you delete it, you could set gc_grace_seconds to zero. You also could make sure to run compaction after you do a lot of deletes.
Documentation here:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
Lastly, there is a feature known as TTL, Time To Live. You could use that instead of deleting and let the database do the "deletes" once the data expires. If you go this route, I would still set gc_grace_seconds to zero and run compactions (via an hourly cronjob since its a dev environment).
I have a single node Cassandra installation on my development machine (and very little experience with Cassandra). I always had very few data in the node and I experienced no problems. I inserted about 9,000 elements in a table today to experiment with a real world use case. When I start up the node the boot time is extremely long now. I get this in system.log
Replaying /var/lib/cassandra/commitlog/CommitLog-3-1388134836280.log
...
Log replay complete, 9274 replayed mutations
That took 13 minutes and is hardly bearable. I wonder if there is a way to store data in such a way that can be read at once without replaying the log. After all 9,000 elements are nothing and there must be a quicker way to boot. I googled for hints and searched into Cassandra's documentation but I didn't find anything. It's obvious that I'm not looking for the right things, would anybody be so kind to point me to the right documents? Thanks.
There are a few things that might help. The most obvious thing you can do is flush the commit log before you shutdown Cassandra. This is a good idea to do in production too. Before I stop a Cassandra node in production I'll run the following commands:
nodetool disablethrift
nodetool disablegossip
nodetool drain
The first two commands gracefully shut down connections to clients connected to this node and then to other nodes in the ring. The drain command flushes memtables to disk (sstables). This should minimize what needs to be replayed on startup.
There are other factors that can make startup take a long time. Cassandra opens all the SSTables on disk at startup. So the more column families and SSTables you have on disk the longer it will take before a node is able to start serving clients. There was some work done in the 1.2 release to speed this up (so if you are not on 1.2 yet you should consider upgrading). Reducing the number of SSTables would probably improve your start time.
Since you mentioned this was a development machine I'll also give you my dev environment observations. On my development machine I do a lot of creating and dropping column families and key spaces. This can cause some of the system CFs to grow significantly and eventually cause a noticeable slowdown. The easiest way to handle this is to have a script that can quickly bootstrap a new database and blow away all the old data in /var/lib/cassandra.