Is it safe to disable commit log? - cassandra

Is it safe to disable commit log if we use replication? When a node fails it is often due to hard disk failure so in that commit log would not help us for durability but replication would. Why do we even need a commit log when we use replication?

With no commit log, data stored in the memtables on the replicas may take a long time (could be unbounded, but in practice is often minutes) to be written to disk. This means, within that window, you could lose writes. If, for example, your data center loses power, you could lose all the writes for the last few minutes on all replicas. The commit log syncs (by default) every 10 seconds so you would lose at most 10 seconds of data in the event of simultaneous failure.
However, if you're using multi-data center replication then to lose data you would need simultaneous failures across data centers.
It's a trade off: commit log with no replication guards against a single node crashing or having a non-destructive failure. With replication in single DC, it guards against whole DC failure e.g. power failure. With replication in multi-DC, it guards against correlated failures. You can decide how much resilience you need based on the cost of enabling the commit log versus the cost of losing recent writes.

Related

Will the Write-Ahead-Log become the bottleneck of Cassandra?

In a Cassandra database, a write needs to be logged in the Write Ahead Log first and then added to the memtable in memory. Since the Write Ahead Log is on disk, although it performs sequential writes(i.e., append only), will it still be much slower than memory access, thus become the performance bottleneck for the writes?
If I understand it correctly, Cassandra supports the mechanism to store the Write Ahead Log in OS cache, and then flush it to disk every pre-configured amount of time(say 10 seconds). However, does it mean the data changes made within this 10 seconds could be all lost if the machine crashes?
You can control if the sync of commit log using the commitlog-sync configuration. By default it's periodic, and synced to disk every 10 seconds (controlled by commitlog_sync_period_in_ms setting).
And yes, if you lose the power there is a risk that data in the commit log is lost. But Cassandra relies on the fact that you have multiple replicas, and if you did setup correctly, each replica should be in separate rack (at least, better if you have additional data centers) with separate power, etc.

Write consistency = ALL: do I still need weekly repairs to avoid zombie records?

As far as I understood, the problem of deleted data reappearing in Cassandra is as follows:
A delete is issued with consistency < ALL (e.g. QUORUM)
The delete succeeds, but some nodes in the replication set were not reachable during the delete
A tombstone is written to all the reached nodes, nothing in the others
10 days pass, tombstone are eligible to be expired
Compactions happen, tombstones are actually removed
A read is issued: the nodes which received the delete reply with "no data"; the nodes which were unavailable during the delete reply with the old data; a zombie is produced
Now my question is: if the original delete was issued with consistency = ALL, all the nodes would either have the tombstone (before expiry&compaction) or no data at all (after expiry&compaction). No zombies should then be produced, even if we did not issue a repair before tombstone expiry.
Is this correct?
Yes you still need to run repairs even with CL.ALL on the delete if you want to guarantee no resurrected data. You just decrease likelihood of it occurring without you noticing it.
If a node is unavailable for the delete, the delete will fail for the client (because cl.all) but the other nodes all still received the delete. Even if your app will retry the delete theres a chance of it failing (ie your app's server hit by a meteor). So then you have a delete that has been seen by 2 of your 3 replicas. If you lowered your gc_grace and don't run repairs the other anti-entropy measures (hints, read repairs) may not ensure the tombstone (they are best effort not guarantee) was seen by the 3rd node before the tombstone is compacted away. The next read touches 3rd node which has the original data, and no tombstone exists to say it was deleted so you resurrect the data as its read repaired to other replicas.
What you can do is log a statement somewhere to point when there is a cl.all timeout or failure. This is not a guarantee since your app can die before the log, and a failure does not actually mean that the write did not get to all replicas - just that it may of failed to write. That said I would strongly recommend just using quorum (or local_quorum). That way you can have some host failures without losing availability since you need the repairs for the guarantee anyway.
When issuing queries with Consistency=ALL, every node having the token range of that particular record has to acknowledge. So if one of the NODE was down during this process, the DELETE will fail as it can't achieve the required consistency=ALL.
So consistency=ALL, might end up being a scenario where every node in the cluster has to stay up otherwise queries will fail. That's why people recommend to use lesser stronger consistency like QUORUM. So you are sacrificing high availability for REPAIRs if you want to perform queries at CONSISTENCY=ALL

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

Using Cassandra as a Queue

Using Cassandra as Queue:
Is it really that bad?
Setup: 5 node cluster, all operations execute at quorum
Using DateTieredCompaction should significantly reduce the cost of TombStones, and allow entire SSTables to be dropped at once.
We add all messages to the queue with the same TTL
We partition messages based on time (say 1 minute intervals), and keep track of the read-position.
Messages consumed will be explicitly deleted. (only 1 thread extracts messages)
Some Messages may be explicitly deleted prior to being read (i.e. we may have tombstones after the read-position). (i.e. the TTL initially used is an upper limit) gc_grace would probably be set to 0, as quorum reads will do blocking-repair (i.e. we can have repair turned off, as messages only reside in 1 cluster (DC), and all operations a quorum))
Messages can be added/deleted only, no updates allowed.
In our use case, if a tombstone does not replicate its not a big deal, its ok for us to see the same message multiple times occasionally. (Also we would likely not run Repair on regular basis, as all operations are executing at quorum.)
Thoughts?
Generally, it is an anti-pattern, this link talks much of the impact on tombstone: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
My opinion is, try to avoid that if possible, but if you really understand the performance impact, and it is not an issue in your architecture, of course you could do that.
Another reason to not do that if possible is, the cassandra data structure is not designed for queues, it will always look ugly, UGLY!
Strongly suggest to consider Redis or RabbitMQ before making your final decision.

Cassandra is configured to lose 10 seconds of data by default?

As the data in the Commitlog is flushed to the disk periodically after every 10 seconds by default (controlled by commitlog_sync_period_in_ms), so if all replicas crash within 10 seconds, will I lose all that data? Does it mean that, theoretically, a Cassandra Cluster can lose data?
If a node crashed right before updating the commit log on disk, then yes, you could lose up to ten seconds of data.
If you keep multiple replicas, by using a replication factor higher than 1 or have multiple data centers, then much of the lost data would be on other nodes, and would be recovered on the crashed node when it was repaired.
Also the commit log may be written in less than ten seconds it the write volume is high enough to hit size limits before the ten seconds.
If you want more durability than this (at the cost of higher latency), then you can change the commitlog_sync setting from periodic to batch. In batch mode it uses the commitlog_sync_batch_window_in_ms setting to control how often batches of writes are written to disk. In batch mode the writes are not acked until written to disk.
The ten second default for periodic mode is designed for spinning disks, since they are so slow there is a performance hit if you block acks waiting for commit log writes. For this reason if you use batch mode, they recommend a dedicated disk for the commit log so that the write head doesn't need to do any seeks to keep the added latency as low as possible.
If you are using SSDs, then you can use more aggressive timing since the latency is greatly reduced compared to a spinning disk.
Cassandra's default configuration sets the commitlog_sync mode to periodic, causing the commit log to be synced every commitlog_sync_period_in_ms milliseconds, so you can potentially lose up to that much data if all replicas crash within that window of time.

Resources