In Cassandra cluster say I have twp nodes, Now clients send update for the same record(with different values) exactly at same time which goes to
two different nodes of Cassandra cluster. As Cassandra works in master less mode and both nodes can take the update request,
My question is how this conflict will be resolved during eventual consistency and which value will ultimately take precedence ?
Here is the example scenario
Initial data: KeyA: { colA:"val AA", colB:"val BB"}
Client 1 sends update: `update data set colA:"val C1_ColA" where
colB="val BB"` and data becomes below at node_1
KeyA: { colA:"val C1_ColA", colB:"val BB"}
Client 2 `update data set colA:"val C2_ColA" where
colB="val BB"` and data becomes becomes below at node_2
KeyA: { colA:"val C2_ColA", colB:"val BB"}
Now how the value of colA will eventually be resolved here ?
last write always wins, and I doubt that the timestamps will be the same - they are with microseconds resolution, so it's very low probability that timestamp will have the same value.
If you want to prevent this situation, then you can use lightweight transactions that allow to put condition on insert/updates/deletes, but you need to keep in mind that they are very resource intensive, and will add quite big load to the cluster.
Related
We have two CAS queries. It was working just fine with 2 containers per region. We have increased containers from 2 to 3 then we started seeing the WriteTimeoutException. The traffic is same or even less compared to the regular business hours. Cassandra is in 3 regions and each cluster has 3 hosts.
Not sure what could be the reason for these errors, but the change was increase in the application container by one. Appreciate if any help here to debug further.
UPDATE order_sequences USING TTL 10 set instance_name = ? where id_name = ? IF instance_name = null", ConsistencyLevel.QUORUM)
UPDATE order_sequences SET next_id= ? where id_name= ? IF next_id= ? AND instance_name = ?", ConsistencyLevel.QUORUM),
Error stack:
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during CAS write query at consistency SERIAL (7 replica were required but only 0 acknowledged the write) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:85) at
com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:23) at
com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35) at
com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:59) at
com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58) at
CAS write are a specialized metric which are triggered when a compare and set is conducted. LWT transaction is known as compare and set (CAS); replica data is compared and any data found to be out of date is set to the most consistent value.
In Cassandra, the process combines the Paxos protocol with normal read and write operations to accomplish the compare and set operation.
The Paxos protocol is implemented as a series of phases:
• Prepare/Promise
• Read/Results
• Propose/Accept
• Commit/Acknowledge
These four phases require four round trips between a node proposing a lightweight transaction and any cluster replicas involved in the transaction. The performance will be affected. Consequently, reserve lightweight transactions for situations where concurrency must be considered.
For example, the following series of operations can fail:
DELETE ...
INSERT .... IF NOT EXISTS
SELECT ....
The following series of operations will work:
DELETE ... IF EXISTS
INSERT .... IF NOT EXISTS
SELECT .....
Would strongly recommend you to check the "CAS write latency" statistics from
"nodetool proxyhistograms" command, it provides a histogram of network statistics at the time of the command.
Could you please let me know in case if you are still facing this error ?
I currently have a table set up in Cassandra that has either text, decimal or date type columns with a composite partition key of a business_date and an account_number. For queries to this table, I need to be able to support look-ups for a single account, or for a list of accounts, for a given date.
Example:
select x,y,z from my_table where business_date = '2019-04-10' and account_number IN ('AAA', 'BBB', 'CCC')
//Note: Both partition keys are provided for this query
I've been struggling to resolve performance issues related to accessing this data because I'm noticing latency patterns that I am having trouble trying to understand / explain.
In many scenarios, the same exact query can be run a total of three times in a short period by the client application. For these scenarios, I see that two out of three requests will have really bad response times (800 ms), and one of them will have a really fast one (50 ms). At first I thought this would be due to key or row caches, however, I'm not so sure since I believe that if this were true, the third request out of the three should always be the fastest, which isn't the case.
The second issue I believed I was facing was the actual data model itself. Although the queries are being submitted with all the partition keys being provided, since it's an IN clause, the results would be separate partitions and can be distributed across the cluster and so, this would be a bad access pattern. However, I see these latency problems when even single account queries are run. Additionally, I see queries that come with 15 - 20 accounts performing really well (under 50ms), so I'm not sure if the data model is actually an issue.
Cluster setup:
Datacenters: 2
Number of nodes per data center: 3
Keyspace Replication:local_dc = 2, remote_dc = 2
Java Driver set:
Load-balancing: DCAware with LatencyAware
Protocol: v3
Queries are still set up to use "IN" clauses instead of async individual queries
Read_consistency: LOCAL_ONE
Does anyone have any ideas / clues of what I should be focusing on in terms of really identifying the root cause of this issue?
the use of IN on the partition key is always the bad idea, even for composite partition keys. The value of partition key defines the location of your data in cluster, and different values of partition key will most probably put data onto different servers. In this case, coordinating node (that received the query) will need to contact nodes that hold the data, wait that these nodes will deliver results, and only after that, send you results back.
If you need to query several partition keys, then it will be faster if you issue individual queries asynchronously, and collect result on client side.
Also, please note that TokenAware policy works best when you use PreparedStatement - in this case, driver is able to extract value of partition key, and find what server holds data for it.
I have a development cassandra cluster of two cassandra nodes [Let's call them NodeA and NodeB]. I also have a script that is continuously sending data on NodeA. I have created the database with the following parameters:
CREATE KEYSPACE test_database WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Now, for some reason NodeB is stoping after some time. But the issue is, as soon as NodeB stops, the script that is sending data to NodeA starts giving data insertion error.
Can anyone point out a probable reason for the same.
Update: Both the nodes are seed nodes.
How Cassandra handle data repartition
Each key in cassandra can be converted to a token. When you install your cluster, the nodes calculate what range of token they will accept.
Let's take a simple example:
You have two nodes, and a token that goes from 0 to 9. A simple repartition would be: node A stores every token between 0-4 and node B stores every token between 5-9.
How Cassandra works for write
You choose a Coordinator (in your case node A), that receive the data. This node will then calculate a token. As seen in the first example, every node has a range of token assigned to it. So imagine the key is converted to token 4, then the data goes to node A (here the coordinator). If the token is 8, the data will be sent to node B.
What is cassandra data replication factor
The replication factor is how many time your data will be stored on your cluster. For a single database with no racks (your case), the data is first send to the node who owns the token associated with the key, and the replicas are sent to the next node in the topology.
In case of failure of one node, the replicas will help the node to restore its data.
In your case, there are no replicas, and if a node is down, Cassandra can't store the data and throws an error. If you have replication factor 2, Cassandra should be able to store a replica on node A and not fail.
Cassandra's Replication Factor:
Lets say we have 'n' as replication factor which means given input data will be stored/retrieved from 'n' nodes.
t
If you mention the replication factor as '1' which means only one node will have the data.
Partitioning:
Lets say we have 2 nodes, whenever you are inserting the data. Both these nodes will have some data, based on partitioning algorithm mentioned.
For example:
You are inserting 10 records, based on the hashing and partitioning algorithm, it chooses which node needs to be written for each record. Of-course the identification of node is done by the Coordinator :)
Durable Writes:
By default, cassandra always write in commit-log before flushing to disk. If you set to false, it will bypass commit-log and write directly to disk(SSTable).
The problem you have mentioned, for example lets say you are inserting 10 rows.
For simplicity, we can make the partitioning/hashing calculation as n/2.
So, Cassandra's Coordinator node splits up your data into two pieces(for simple calculation it will be 10/2) and tries to put 1st half in to 1st node and succeeds and tries to put the 2nd half into the second node(writing to commit-log), since it is unavailable it is throwing error.
So how do we fix this issue? lets say I want to batch insert multiple insert queries when 1 node in a cluster is down? It returns me
Connection to Cassandra cluster associated with connection cs1 not available due to Host not available. Host Address: cassandra1
If your table is not counter table , you can use consistency level of ANY which gives high availaiblity for write.
Refer this to learn more about it => https://www.datastax.com/blog/2011/05/understanding-hinted-handoff-cassandra-08
I know that Cassandra have different read consistency levels but I haven't seen a consistency level which allows as read data by key only from one node. I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read. Even if we choose a consistency level of one we will ask all nodes but wait for the first response from any node. That is why we will load not only one node when we read but 3 (4 with a coordinator node). I think we can't really improve a read performance even if we set a bigger replication factor.
Is it possible to read really only from a single node?
Are you using a Token-Aware Load Balancing Policy?
If you are, and you are querying with a consistency of LOCAL_ONE/ONE, a read query should only contact a single node.
Give the article Ideology and Testing of a Resilient Driver a read. In it, you'll notice that using the TokenAwarePolicy has this effect:
"For cases with a single datacenter, the TokenAwarePolicy chooses the primary replica to be the chosen coordinator in hopes of cutting down latency by avoiding the typical coordinator-replica hop."
So here's what happens. Let's say that I have a table for keeping track of Kerbalnauts, and I want to get all data for "Bill." I would use a query like this:
SELECT * FROM kerbalnauts WHERE name='Bill';
The driver hashes my partition key value (name) to the token of 4639906948852899531 (SELECT token(name) FROM kerbalnauts WHERE name='Bill'; returns that value). If I am working with a 6-node cluster, then my primary token ranges will look like this:
node start range end range
1) 9223372036854775808 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to 1844674407370955161
5) 1844674407370955162 to 5534023222112865484
6) 5534023222112865485 to 9223372036854775807
As node 5 is responsible for the token range containing the partition key "Bill," my query will be sent to node 5. As I am reading at a consistency of LOCAL_ONE, there will be no need for another node to be contacted, and the result will be returned to the client...having only hit a single node.
Note: Token ranges computed with:
python -c'print [str(((2**64 /5) * i) - 2**63) for i in range(6)]'
I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read
Wrong, with Consistency Level ONE the coordinator picks the fastest node (the one with lowest latency) to ask for data.
How does it know which replica is the fastest ? By keeping internal latency stats for each node.
With consistency level >= QUORUM, the coordinator will ask for data from the fastest node and also asks for digest from other replicas
From the client side, if you choose the appropriate load balancing strategy (e.g. TokenAwareStrategy) the client will always contact the primary replica when using consistency level ONE
In the introduction course of Cassandra DataStax they say that all of the clocks of a Cassandra cluster nodes, have to be synchronized, in order to prevent READ queries to 'old' data.
If one or more nodes are down they can not get updates, but as soon as they back up again - they would update and there is no problem...
So, why Cassandra cluster need synchronized clocks between nodes?
In general it is always a good idea to keep your server clocks in sync, but a primary reason why clock sync is needed between nodes is because Cassandra uses a concept called 'Last Write Wins' to resolve conflicts and determine which mutation represents the most correct up-to date state of data. This is explained in Why cassandra doesn't need vector clocks.
Whenever you 'mutate' (write or delete) column(s) in cassandra a timestamp is assigned by the coordinator handling your request. That timestamp is written with the column value in a cell.
When a read request occurs, cassandra builds your results finding the mutations for your query criteria and when it sees multiple cells representing the same column it will pick the one with the most recent timestamp (The read path is more involved than this but that is all you need to know in this context).
Things start to become problematic when your nodes' clocks become out of sync. As I mentioned, the coordinator node handling your request assigns the timestamp. If you do multiple mutations to the same column and different coordinators are assigned, you can create some situations where writes that happened in the past are returned instead of the most recent one.
Here is a basic scenario that describes that:
Assume we have a 2 node cluster with nodes A and B. Lets assume an initial state where A is at time t10 and B is at time t5.
User executes DELETE C FROM tbl WHERE key=5. Node A coordinates the request and it is assigned timestamp t10.
A second passes and a User executes UPDATE tbl SET C='data' where key=5. Node B coordinates the request and it is assigned timestamp t6.
User executes the query SELECT C from tbl where key=5. Because the DELETE from Step 1 has a more recent timestamp (t10 > t6), no results are returned.
Note that newer versions of the datastax drivers will start defaulting to use Client Timestamps to have your client application generate and assign timestamps to requests instead of relying on the C* nodes to assign them. datastax java-driver as of 3.0 now defaults to client timestamps (read more about there in 'Client-side generation'). This is very nice if all requests come from the same client, however if you have multiple applications writing to cassandra you now have to worry about keeping your client clocks in sync.