Cassandra Counters WriteTimeoutException - cassandra

Our system has 3-4 tables where we keep counters (data types) of the events fired from our applications. We use Kafka for queueing and application is designed using dropwizard.
The concerned part of the system looks like this:
[Ingestion Module] -> Kafka -> [Analytics Module] -> Cassandra
The data is coming in high volume. And the moment we increase the number of workers/consumers in 'Analytics Module', we start getting the following exceptions:
! com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during COUNTER write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
! at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:88)
! at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:66)
! at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:297)
! at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:268)
! at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
! ... 25 common frames omitted
Cassandra setup:
Nodes: 5
Replication Factor: 2
version: 3.4
Query 1
Can someone please help us out with the possible causes/solutions for this problem? Or please point us in the right direction.
Query 2
I had one more query around 'counter' data type. Is update on the counter data type thread safe or it can lead to inconsistency if we try to update the same counter from multiple workers?

Counter type isn’t “reliable” counter - due its nature you don’t know if write happened, or not. You can retry the operation, but this may lead to double write. If you don’t retry, then you may lose the data.
But if you need a reliable counting, you can use another approach - write every count event as separate row (with I sent marked as idempotent, so it will be retried, and overwrite the same data) inside some partition , and then have a separate job that will go through all rows and sum all individual counts.

Related

Light-weight-transaction Performance in Cassandra

LWT performance is bad according to the cassandra docs. In the following scenario;
Scenario 1;
Read from cassandra (thread 1)
Write to cassandra if above step does not return any data (thread 2)
In this case, above steps are processed different threads. Thread1 read data from cassandra, then pass it to the thread2. Then thread1 will query for new data as soon as possible. This will increase the performance. But there will be 2 connections on cassandra.
Scenario 2;
Write to cassandra using the LWT.
In this case, only one thread will sends query to the cassandra. If LWT performance really bad, this will reduce the overall performance.
I'm not sure which one is better. Does LWT performance really bad?
Given you are providing a guarantee that the race condition between threads will not occur for the same primary key, then you should continue to use the read then write approach.
LWTs go through a 4 phase process, prepare, read, propose, commit - this can result in the process taking 4 times longer than the single operation since it requires 4 round trips between the node acting as the co-ordinator / proposer and the replicas involved in the transaction.

Proper Consistency Level to read 'everything'

I'm creating a sync program to periodically copy our Cassandra data into another database. The database I'm copying from only gets INSERTs - data is never UPDATEd or DELETEd. I would like to address Cassandra's eventual consistency model in two ways:
1 - Each sync scan overlaps the last by a certain time span. For example, if the scan happens every hour, then each scan looks an hour and a half backwards. The data contains a unique key, so reading the same record in more than one scan is not an issue.
2 - I use a Consistency level of ALL to ensure that I'm scanning all of the nodes on the cluster for the data.
Is ALL the best Consistency for this situation? I just need to see a record on any node, I don't care if it appears on any other nodes. But I don't want to miss any INSERTed records either. But I also don't want to experience timeouts or performance issues because Cassandra is waiting for multiple nodes to see that record.
To complicate this a bit more, this Cassandra network is made up of 6 clusters in different geographic locations. I am only querying one. My assumption is that the overlap mentioned in #1 will eventually catch up records that exist on other clusters.
The query I'm doing is like this:
SELECT ... FROM transactions WHERE userid=:userid AND transactiondate>:(lastscan-overlap)
Where userid is the partioning key and transactiondate is a clustering column. The list of userId's is sourced elsewhere.
I use a Consistency level of All to ensure that I'm scanning all of the nodes on the cluster for the data
So consistency ALL has more to do with the number of data replicas read than it does with the number of nodes contacted. If you have a replication factor (RF) of 3 and query a single row at ALL, then Cassandra will hash your partition key to figure out the three nodes responsible for that row, contact all 3 nodes, and wait for all 3 to respond.
I just need to see a record on one node
So I think you'd be fine with LOCAL_ONE, in this regard.
The only possible advantage of using ALL, is that it actually does help to enforce data consistency by triggering a read repair 100% of the time. So if eventual consistency is a concern, that's a "plus." But *_ONE is definitely faster.
The CL documentation talks a lot about 'stale data', but I am interested in 'new data'
In your case, I don't see stale data as a possibility, so you should be ok there. The issue that you would face instead, is in the event that one or more replicas failed during the write operation, querying at LOCAL_ONE may or may not get you the only replica that actually exists. So your data wouldn't be stale vs. new, it'd be exists vs. does not exist. One point I talk about in the linked answer, is that perhaps writing at a higher consistency level and reading at LOCAL_ONE might work for your use case.
A few years ago, I wrote an answer about the different consistency levels, which you might find helpful in this case:
If lower consistency level is good then why we need to have a higher consistency(QUORUM,ALL) level in Cassandra?

Cassandra WriteTimeoutException during CAS write query

We have two CAS queries. It was working just fine with 2 containers per region. We have increased containers from 2 to 3 then we started seeing the WriteTimeoutException. The traffic is same or even less compared to the regular business hours. Cassandra is in 3 regions and each cluster has 3 hosts.
Not sure what could be the reason for these errors, but the change was increase in the application container by one. Appreciate if any help here to debug further.
UPDATE order_sequences USING TTL 10 set instance_name = ? where id_name = ? IF instance_name = null", ConsistencyLevel.QUORUM)
UPDATE order_sequences SET next_id= ? where id_name= ? IF next_id= ? AND instance_name = ?", ConsistencyLevel.QUORUM),
Error stack:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during CAS write query at consistency SERIAL (7 replica were required but only 0 acknowledged the write) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:85) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:23) at
com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) at
com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:59) at
com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) at
CAS write are a specialized metric which are triggered when a compare and set is conducted. LWT transaction is known as compare and set (CAS); replica data is compared and any data found to be out of date is set to the most consistent value.
In Cassandra, the process combines the Paxos protocol with normal read and write operations to accomplish the compare and set operation.
The Paxos protocol is implemented as a series of phases:
• Prepare/Promise
• Read/Results
• Propose/Accept
• Commit/Acknowledge
These four phases require four round trips between a node proposing a lightweight transaction and any cluster replicas involved in the transaction. The performance will be affected. Consequently, reserve lightweight transactions for situations where concurrency must be considered.
For example, the following series of operations can fail:
DELETE ...
INSERT .... IF NOT EXISTS
SELECT ....
The following series of operations will work:
DELETE ... IF EXISTS
INSERT .... IF NOT EXISTS
SELECT .....
Would strongly recommend you to check the "CAS write latency" statistics from
"nodetool proxyhistograms" command, it provides a histogram of network statistics at the time of the command.
Could you please let me know in case if you are still facing this error ?

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

how to rapidly increment counters in Cassandra w/o staleness

I have a Cassandra question. Do you know how Cassandra does updates/increments of counters?
I want to use a storm bolt (CassandraCounterBatchingBolt from storm-contrib repo on github) which writes into cassandra. However, I'm not sure how some of the implementation of the incrementCounterColumn() method works .. and there is also the limitations with cassandra counters (from: http://wiki.apache.org/cassandra/Counters) which makes them useless for my scenario IMHO:
If a write fails unexpectedly (timeout or loss of connection to the coordinator node) the client will not know if the operation has been performed. A retry can result in an over count CASSANDRA-2495.
Counter removal is intrinsically limited. For instance, if you issue very quickly the sequence "increment, remove, increment" it is possible for the removal to be lost
Anyway, here is my scenario:
I update the same counter faster than the updates propagate to other Cassandra nodes.
Example:
Say I have 3 cassandra nodes. The counters on each of these nodes are 0.
Node1:0, node2:0, node3:0
An increment comes: 5 -> Node1:0, node2:0, node3:0
Increment starts at node 2 – still needs to propagate to node1 and node3
Node1:0, node2:5, node3:0
In the meantime, another increment arrives before previous increment
is propagated: 3 -> Node1:0, node2:5, node3:0
Assuming 3 starts at a different node than where 5 started we have:
Node1:3, node2:5, node3:0
Now if 3 gets propagated to the other nodes AS AN INCREMENT and not as a new value
(and the same for 5) then eventually the nodes would all equal 8 and this is what I want.
If 3 overwrites 5 (because it has a later timestamp) this is problematic – not what I want.
Do you know how these updates/increments are handled by Cassandra?
Note, that a read before a write is still susceptible to the same problem depending from which replica node the read executes (Quorum can still fail if propagation is not far along)
I'm also thinking that maybe putting a cache b/w my storm bolt and Cassandra might solve this issue but that's a story for another time.
Counters in C* have a complex internal representation that avoids most (but not all) problems of counting things in a leaderless distributed system. I like to think of them as sharded counters. A counter consists of a number of sub-counters identified by host ID and a version number. The host that receives the counter operation increments only its own sub-counter, and also increments the version. It then replicates its whole counter state to the other replicas, which merge it with their states. When the counter is read the node handling the read operation determines the counter value by summing up the total of the counts from each host.
On each node a counter increment is just like everything else in Cassandra, just a write. The increment is written to the memtable, and the local value is determined at read time by merging all of the increments from the memtable and all SSTables.
I hope that explanation helps you believe me when I say that you don't have to worry about incrementing counters faster than Cassandra can handle. Since each node keeps its own counter, and never replicates increment operations, there is no possibility of counts getting lost by race conditions like a read-modify-write scenario would introduce. If Cassandra accepts the write, your're pretty much guaranteed that it will count.
What you're not guaranteed, though, is that the count will appear correct at all times unless. If an increment is written to one node but the counter value read from another just after, there is not guarantee that the increment has been replicated, and you also have to consider what would happen during a network partition. This more or less the same with any write in Cassandra, it's in its eventually consistent nature, and it depends on which consistency levels you used for the operations.
There is also the possibility of a lost acknowledgement. If you do an increment and loose the connection to Cassandra before you can get the response back you can't know whether or not your write got though. And when you get the connection back you can't tell either, since you don't know what the count was before you incremented. This is an inherent problem with systems that choose availability over consistency, and the price you pay for many of the other benefits.
Finally, the issue of rapid remove, increment, removes are real, and something you should avoid. The problem is that the increment operation will essentially resurrect the column, and if these operations come close enough to each other they might get the same timestamp. Cassandra is strictly last-write-wins and determines last based on the timestamp of the operation. If two operations have the same time stamp, the "greater" one wins, which means the one which sorts after in a strict byte order. It's real, but I wouldn't worry too much about it unless you're doing very rapid writes and deletes to the same value (which is probably a fault in your data model).
Here's a good guide to the internals of Cassandra's counters: http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf
The current version of counters are just not a good fit for a use case that requires guarantees of no over-counting and immediate consistency.
There are increment and decrement operations, and those will not collide with each other, and, barring any lost mutations or replayed mutations, will give you a correct result.
The rewrite of Cassandra counters (https://issues.apache.org/jira/browse/CASSANDRA-6504) might be interesting to you, and it should address all of the current concerns with getting a correct count.
In the meantime, if I had to implement this on top of a current version of Cassandra, and an accurate count was essential, I would probably store each increment or decrement as a column, and do read-time aggregation of the results, while writing back a checkpoint so you don't have to read back to the beginning of time to calculate subsequent results.
That adds a lot of burden to the read side, though it is extremely efficient on the write path, so it may or may not work for your use case.
To understand updates/increments i.e write operations, i will suggest you to go through Gossip, protocol used by Cassandra for communication. In Gossip every participant(node) maintains their state using the tuple σ(K) = (V*N) where σ(K) is the state of K key with V value and N as version number.
To maintain the single version of truth for a data packet Gossip maintains a Reconciliation mechanism namely Precise & Scuttlebutt(current). According to Scuttlebutt Reconciliation, before updating any tuple they communicate with each other to check who is holding the highest version (newest value) of the key. Whosoever is holding the highest version is responsible for the write operation.
For further information read this article.

Resources