there is a multicenter cassandra environment.
and set the consistency-level=local_quorum.
I want to know the latency of the local datacenter and other datacenter.
What I mean is when a data is writen successfully,and what's the time that other datacenter can have the replica.
this metrics is not exposed by cassandra.
Have found that writelatency is collected in org.apache.cassandra.service.StorageProxy.mutate method.
and want to add code in there to achieve collecting the latency of datacenter.
but the problem is cassandra write finish when the num of write consistency-level success,I cannot block the write transaction.
how to keep the sync between
write memtable and
write merics
have no idea going on.anybody have idea on achieving this,pls help a look.
There isnt anything available at this time directly, there is a ticket with patch available at CASSANDRA-11569 though.
There are some tricks you can try in mean time.
If you enable trace on a query (CL.ALL) you can check the trace events table to see the time that the mutations left coordinator and when it arrives on the replica.
You can make a local quorum write query, then a each quorum write query and track difference.
Theres a problem with some of these metrics in tracking mutations. Cassandra will piggyback all the writes in that DC over a single proxy write (vs coordinator actually sending to each node). If that node hits a GC it is likely to get a spike. Speculative retry will help with that affecting latency in an extreme case but then your not really tracking your raw cross dc latency. May want to just consider "ping".
Related
I have a datastax cassandra cluster with 8 nodes. The keyspace used by the application contains about 400 Tables. The parameter write_request_timeout_in_ms in the cassandra.yaml is set to 2000ms (default).
The default value is high enough for most tables. However, I for only two tables I require a much higher write_request_timeout. I know that stuff such as bloomfilter false-positive chance or compaction strategy can be configured per table.
Is that possible to do the same for timeouts and if so then how?
Regards
It isn't possible to configure different write timeouts because all writes are persisted to the same commitlog disk.
A coordinator will return a write timeout if not enough replicas (based on the write consistency level) acknowledged the write (to the commitlog disk) because the disk is busy.
Since there is only one commitlog disk on each node, it makes no sense to have different write timeouts. This in fact raises another question -- what problem are you trying to solve?
Increasing timeouts is almost never the right thing to do since all it does is hide the problem. You need to identify the root of the issue and fix it. Cheers!
We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html
I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.
Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.
I'm a newbie to Cassandra and have a question on the commit log which is configured to use periodic mode (10 seconds).
Suppose we have a node that processes a request with CF = 1 and RF = 3. If the node is in a state in which the commit log has not been flushed to disk and replication of the data is also pending, would we loose data if the node crashes in this state?
Another follow-up question, which node is responsible for replicating the data on other nodes based on RF=3? Is is the coordinator node or some other node which processes the request depending on consistency level?
I think following link might be of use to you:
https://www.ecyrd.com/cassandracalculator/
Yes, data loss is possible in this scenario because data would not reach other nodes, so no copies exist. As if the data was not there. The thing is this window is actually quite small because with RF 3 the other nodes will receive the insert within the milliseconds (Unless there is some really heavy load on the node).
All of the RF requests (per single client request) are handled by the coordinator. Also if the node might not be there when the coordinator needs to replicate it stores the data in a hint.
So to sum it up yes data loss is possible but the probability is really small.
With CL=ONE when a coordinator crashes and goes down uncleanly there is a window where data loss is possible before the mutation is sent to replicas and commit log is flushed. Its pretty small window and unlikely but if its a concern use local quorum or batch mode.
The coordinator will send data to all replicas and store hints for whatever hasn't acked.
Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.