How clock synchronization affects the performance of YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
I would like to examine how the accuracy of clock synchronization affects the performance of YugabyteDB. How do you usually test such things internally?

To answer your question, we use Jepsen tests to verify correctness, which introduces clock skew issues (as a part of a larger set of failures it introduces).
This report is old, but it explains the point: https://blog.yugabyte.com/yugabyte-db-1-2-passes-jepsen-testing/ (see “Pushing the boundary on tolerating clock skew”)
YugabyteDB uses a combination of the monotonic and the HLC to make
sure it is much more resilient to clock skew. Note that the
—max_clock_skew_usec config flag controls the maximum clock skew
allowed between any two nodes in the cluster and has a default value
of 50ms. Kyle calls this out in his report (emphasis is mine in these
quotes): YugabyteDB uses Raft, which ensures linearizability for all
(write) operations which go through Raft’s log. For performance
reasons, YugabyteDB reads return the local state from any Raft leader
immediately, using leader leases to ensure safety. Using
CLOCK_MONOTONIC for leases (instead of CLOCK_REALTIME) insulates
YugabyteDB from some classes of clock error, such as leap seconds.
Between shards, YugabyteDB uses a complex scheme involving Hybrid
Logical Clocks (HLCs). Yugabyte couples those clocks to the Raft log,
writing HLC timestamps to log entries, and using those timestamps to
advance the HLC on new leaders. This technique eliminates several
places where poorly synchronized clocks could allow consistency
violations.

Related

YugabyteDB deployment in 2 datacenters

[Question posted by a user on YugabyteDB Community Slack]
I have two different Datacentres, I want the app to write simultaneously to the database on both data centers. Both the instances in the Primary and Secondary datacentres should be active and accept writes and should replicate synchronously or async. However, ACID property should be maintained so that data is consistently read at both sites. The database in Primary should have all data that Secondary has and vice versa. The latency between the datacenters is 40ms.
Option 1: Using a single, multi-region YugabyteDB cluster stretched across datacenters. YB uses synchronous replication within a single cluster.. this uses quorum (consensus) protocol.
For this deployment, because of the use of RAFT protocol, an odd number of data centers, typically 3, is recommended so that you can tolerate a datacenter failure and still be active. Disadvantage: This deployment will generally have higher latency for write operation. The further apart the DCs are, the higher the latency will be.
At the tablet level, leaders have to coordinate the writes. So no matter which node / DC you are coming from. The request will first have to be routed to the leader of the tablet. This could add 40ms if the leader for shard was in DC1 but your write request (i.e. app is running in DC(⅔)). You will not have this penalty if you are primarily writing from one DC and you pick that DC as the preferred zone to keep all your leaders(by default).
On top of that, the number of network round-trips depends on whether the operation is a fast-path (single-shard) transaction - e.g., a simple single-row INSERT (in which case about another 40ms). Or, if it involves a distributed transaction (e.g., a multi-row INSERT or an INSERT to a table with one more secondary index) -- this case will involve about two network round-trips.. so closer to 80-100ms.
Option 2: Using two YugabyteDB clusters.. one in DC1 and one in DC2. Each is RF=3, and asynchronously replicating to each other. Both clusters can take writes, and latency of writes will stay in intra-DC latencies (so it'll be much faster than option 1). However, you will not have immediate consistency at both sites with async replication, so no full ACID implementation. Furthermore, if you are taking writes on both sites and doing async replication bidirectionally, then care has to be exercised. If the two clusters are touching an unrelated set of keys/records, then less of an issue. If they are updating the same records, then the semantics is just "latest writer wins", and not going to be ACID. Basically, bidirectional async replication should be used carefully/has these caveats as you can imagine due to the very nature of async replication.

How can Cassandra tuned to be CA in CAP theorem?

I know The CAP theorem:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Cassandra is typically classified as an AP system, I heard yes it can turned to CA, but I didn't find the documentation.
How to use CA Cassandra ?
Thanks.
Generally speaking, the 'P' in CAP is what NoSQL technologies were built to solve for. This is usually accomplished by spreading data horizontally across multiple instances.
Therefore, if you wanted Cassandra to run in a "CA" CAP configuration, running it as a single node cluster would be a good first step.
I heard yes it can turned to CA, but I didn't find the documentation.
After re-reading this, it's possible that you may have confused "CA" with "CP."
It is possible to run Cassandra as a "CP" database, or at least tune it to behave more in that regard. The way to go about this, would be to set queries on the application side to use the higher levels of consistency, like [LOCAL_]QUORUM, EACH_QUORUM, or even ALL. Consistency could be tuned even higher, by increasing the replication factor (RF) in each keyspace definition. Setting RF equal to number of nodes and querying at ALL consistency would be about as high as it could be tuned to be consistent.
However, I feel compelled to mention at what a terrible, terrible idea this all is. Cassandra was engineered to be "AP." Fighting that intrinsic design is a fool's errand. I've always said, nobody wins when you try to out-Cassandra Cassandra.
If you're employing engineering time to make a datastore function in ways that are contrary to its design, then a different datastore (one you don't have to work against) might be the better choice.

EventProcessorClient partition ownership management

How does partition lease ownership management works in EventProcessorClient? There is an article with high level description but I am looking for more details, like lease management, timings, expiration etc.
Note: question is about EventProcessorClient, not EventProcessorHost.
The EventProcessorClient uses a conservative approach to claiming partitions intended to allow processors to cycle up without having partitions "bounce" between them as they fight for ownership until things have stabilized.
The load balancing cycle ticks every 10 seconds, at which point the processor will determine whether or not to claim an additional partition. Partition ownership uses a 30 second lease, which is used primarily to determine when a partition may have been orphaned; each cycle, the lease for claimed partitions is extended when the owner is healthy. The PartitionLoadBalancer implementation is fairly heavily commented, if you're interested in further details.
The values for the load balancing configuration can be seen by observing the defaults for the EventProcessorOptions class, which influences the EventProcessor<TPartition> class that serves as a base for the EventProcessorClient.
At present, the load balancing configuration is not exposed within the EventProcessorClientOptions in order to simplify the API surface, though that is a topic of discussion for some changes that are being worked on in the near future.

How is Cassandra not consistent but HBase is?

While going through the reading materials of Cassandra and HBase I found that Cassandra is not consistent but HBase is. Didn't find any proper reading materials for the same.
Could anybody provide any blogs/articles on this topic?
Cassandra is consistent, eventually. Based in Brewer's theorem (also known as CAP theorem), distributed data systems can only guarantee to achieve 2 of the following 3 characteristics:
Consistency.
Availability.
Partition tolerance.
What this means is that Cassandra, in its default configuration, can guarantee to be available and partition tolerant, and there may be a delay before achieving consistency. But this is configurable, as you can increase the consistency levels for any query, sacrificing partition tolerance.
There are multiple resources in the web, you should look up for "eventual consistency in Cassandra", you can start with Ed Capriolo's talk, or this post in quora
Actually, since version 1.1 HBase has two consistency models:
Consistency.STRONG is the default consistency model provided by HBase.
In case the table has region replication = 1, or in a table with
region replicas but the reads are done with this consistency, the read
is always performed by the primary regions, so that there will not be
any change from the previous behaviour, and the client always observes
the latest data.
In case a read is performed with Consistency.TIMELINE, then the read
RPC will be sent to the primary region server first. After a short
interval (hbase.client.primaryCallTimeout.get, 10ms by default),
parallel RPC for secondary region replicas will also be sent if the
primary does not respond back...
In other words, strong consistency is achieved by allowing reads only against replica that does the writing, while timeline consistent (Ref. Guide makes it a point to differentiate timeline vs eventual consistency) behavior provides highly available reads with low-latency at the expense of a small chance of reading stale data.

Which part of the CAP theorem does Cassandra sacrifice and why?

There is a great talk here about simulating partition issues in Cassandra with Kingsby's Jesper library.
My question is - with Cassandra are you mainly concerned with the Partitioning part of the CAP theorem, or is Consistency a factor you need to manage as well?
Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency. However, real world systems rarely fall neatly into these categories, so it's more helpful to view CAP as a continuum. Most systems will make some effort to be consistent, available, and partition tolerant, and many (including Cassandra) can be tuned depending on what's most important. Turning knobs like replication factor and consistency level can have a dramatic impact on C, A, and P.
Even defining what the terms mean can be challenging, as various use cases have different requirements for each. So rather than classify a system as CP, AP, or whatever, it's more helpful to think in terms of the options it provides for tuning these properties as appropriate for the use case.
Here's an interesting discussion on how things have changed in the years since the CAP theorem was first introduced.
CAP stands for Consistency, Availability and Partition Tolerance.
In general, its impossible for a distributed system to guarantee above three at a given point.
Apache Cassandra falls under AP system meaning Cassandra holds true for Availability and Partition Tolerance but not for Consistency but this can further tuned via replication factor(how many copies of data) and consistency level (read and write).
For more info: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
Interestingly it depends on your Cassandra configuration. Cassandra can at max be AP system. But if you configure it to read or write based on Quorum then it does not remain CAP-available (available as per definition of the CAP theorem) and is only P system.
Just to explain things in more detail CAP theorem means:
C: (Linearizability or strong consistency) roughly means
If operation B started after operation A successfully completed, then
operation B must see the system in the same state as it was on
completion of operation A, or a newer state (but never older state).
A:
“every request received by a non-failing [database] node in the system must result in a [non-error] response”. It’s not sufficient for some node to be able to handle the request: any non-failing node needs to be able to handle it. Many so-called “highly available” (i.e. low downtime) systems actually do not meet this definition of availability.
P
Partition Tolerance (terribly misnamed) basically means that you’re communicating over an asynchronous network that may delay or drop messages. The internet and all our data centres have this property, so you don’t really have any choice in this matter.
Source: Awesome Martin kleppmann's work
The CAP theorem states that a database can’t simultaneously guarantee consistency, availability, and partition tolerance
Since network partitions are part of life, distributed databases tend to be either CP or AP
Cassandara was meant for AP but you can fine tune consistency at the cost of availability.
Availability : It was ensured with replicas. Cassandra typically writes multiple copies to different cluster nodes (generally 3). If one node is unavailable, data won't be lost.
Writing data to multiple nodes will take time because nodes are scattered in different location. At some point of time, data will become eventually consistent.
So with high availability preference, consistency is compramised.
Tunable consistency:
For read or write operation, you can mention consistency level. Consistency level refers to the number of replicas that need to respond for a read or write operation to be considered complete.
For non-critical features, you can provide less consistency level : say 1.
If you think consistency is important, you can increase the level to TWO, THREE or QUORAM ( A majority of replicas)
Assume that you set the consistency level to high (QUORAM) for your critical features and majority of the nodes are down. In this case, write operation will fail.
Here Cassandra sacrificies availabiltiy for consistency.
Have a look at this article for more details.

Resources