Using Cassandra as a Queue - cassandra

Using Cassandra as Queue:
Is it really that bad?
Setup: 5 node cluster, all operations execute at quorum
Using DateTieredCompaction should significantly reduce the cost of TombStones, and allow entire SSTables to be dropped at once.
We add all messages to the queue with the same TTL
We partition messages based on time (say 1 minute intervals), and keep track of the read-position.
Messages consumed will be explicitly deleted. (only 1 thread extracts messages)
Some Messages may be explicitly deleted prior to being read (i.e. we may have tombstones after the read-position). (i.e. the TTL initially used is an upper limit) gc_grace would probably be set to 0, as quorum reads will do blocking-repair (i.e. we can have repair turned off, as messages only reside in 1 cluster (DC), and all operations a quorum))
Messages can be added/deleted only, no updates allowed.
In our use case, if a tombstone does not replicate its not a big deal, its ok for us to see the same message multiple times occasionally. (Also we would likely not run Repair on regular basis, as all operations are executing at quorum.)
Thoughts?

Generally, it is an anti-pattern, this link talks much of the impact on tombstone: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
My opinion is, try to avoid that if possible, but if you really understand the performance impact, and it is not an issue in your architecture, of course you could do that.
Another reason to not do that if possible is, the cassandra data structure is not designed for queues, it will always look ugly, UGLY!
Strongly suggest to consider Redis or RabbitMQ before making your final decision.

Related

Cassandra repairs on TWCS

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Why do tables get out of sync over time when Write Consistency ALL is used?

Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".

If all replicas will sync up eventually, what's the point of read repairs?

If all replicas will sync up eventually, what's the point of read repairs?
Wouldn't you have cases where if you have a write that's being sent to all replicas, then a read repair happens before the write, wouldn't you essentially be duplicating that same write twice?
Theres a few things, blocking read repairs, async read repairs, and if either are needed.
Blocking read repairs: Quorum reads are monotonically consistent for awhile now. If you read it twice you should get the same answer. People tend to use QUORUM reads as wanting stronger consistency, so the blocking read repairs prevent reads from going back in time. If this behavior ends it would cause potential surprises to existing applications. However the latency impact of these repairs have been causing issues and it may still be changed in very near future. You cannot currently disable this behavior and it will always be on.
Async read repairs: Repairs in background can be disabled or happen only a small percentage of time, or (recommended) only within a DC. This reduces background impact as there isnt as much cross DC traffic. This is controlled by the dc_local and global read repair chance settings. When you do a ONE or LOCAL_ONE etc query it will depending on that chance wait for the rest of the responses and compare results for a read repair.
Statistically your far more likely to be having unnecessary work with async read repairs as on a normal functioning perfect system they are not needed. Hinted Handoff however is not perfect and there are cases where hints are lost. In these situations the consistency will not be met until a anti-entropy repair is run (should be weekly or even daily depending on how repairs run, inc or full etc).
So other than for the sake of monotonic consistency (blocking on QUORUM+ requests), read repairs are not really critical or needed. Its something people have used to statistically put cluster in a more consistent state faster (maybe). Lots of people run without async read repairs (you cannot currently disable the read repair mechanism fwiw), and theres even serious talk of removing options around it completely due to misunderstandings.
One scenario that makes sense to me is this:
You write the data to a node (or a subset of the cluster)
You read the data (with Quorum), and one of the nodes has the fresher data.
because you specified QUROUM, several nodes are being asked for the value before the response is sent to the client. But because the data is fresher on one of the nodes, a blocking read-repair must first happen, to update all of them.
in this case, the read-repair needs to happen because the "eventual update" has yet to happen.
In highly dynamic applications with many nodes, there are times when an eventually consistent write doesn't make it to the node PRIOR to a read request for that piece of data on that node. This is common in environments with heavy load on an undersized cluster, unknown hardware issues and other reasons. Its likely also where write consistency is set to one, while read consistency is set to local_quorum.
Example 1: random & sporadic network drops due to an unknown network switch failing that affects the write to the node but doesn't affect the read.
Example 2: the write occurs during a peak load time period, and as a result doesn't make it to the overloaded node, prior to the read request.

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

Cassandra for messaging

I am now doing a small research to find a way to store huge volume of data (temporarily, till some consumers consume these messages) from various 'message producers' (source). The data come from different sources, say HTTP, FTP, SMPP or file upload, each type may have tens or hundreds of instances creating messages. The messages produced by them can grow so huge that the message consumers may lag behind in consuming the messages as the processes may take long or not short time. Now, the system uses RabbitMQ in some parts, but its performance drops when huge volume of unconsumed message grows (I'm also looking into improving its performance, but that's different). As an alternate, I am looking on to Apache Kafka which uses the disk for persisting messages.
As I read through many articles in the internet, I read some articles that talks about the Apache Cassandra with very fast write, processing a million inserts per second and similar volume reads. I was astonished, and tried to find some leads in using Cassandra for my case but with no clear results.
Assuming I have large number of message producers, can Cassandra (cluster) handle inserts (in batches) so faster (overall high throughput) that the producers does not throttle?
I am sure some among you could have used Cassandra for this or similar kind of use cases, share you experiences. (I am ready to provide you any more information if this does not suffice)
Yes! Cassandra can handle writes very effectively. But in my experience, using it as a messaging system (queue and the likes) brings some technical constraints because of the tombstones.
Cassandra doesn't remove deleted rows immediately and marks them with a tomstone to be garbarge collected later. Overtime, if there are a lot of deletions (eg. dequeue messages), the overall performance will be hurt, and quite quickly.
You can go for Cassandra but you will have to implement work around for the tomstone problem (time bucket, multiple status tables).
IMHO, Apache Kafka is much more appropriate to the messaging use case and can also be scaled massively.

Resources