Yugabyte failover time - yugabytedb

Can someone answer?
what is the failover time if a node goes down, I did see it take 2 to 3 seconds but do we have control to make it faster than this? if yes how much. Typically, we cannot take this hit on the application queue it is that sensitive.

In YugabyteDB, tables are shared into tablets, and tablets are replicated among nodes using Raft (a distributed consensus protocol). Raft is also used to elect, for each tablet, one of the tablet's peers as the leader.
In a typical situation, a node will have many tablets-- some in follower role, and some in leader role. When a node fails, tablets for which this node is a leader can have a small amount of unavailability until new leaders are elected for those tablets. (Note: YugabyteDB is a CP database). This leader (re)election is triggered when the followers of a tablet have not heard from their leader for a certain number of heartbeats. The knobs that govern this, and hence determine the failover time, are the following gflags:
raft_heartbeat_interval_ms (default 500ms)
leader_failure_max_missed_heartbeat_periods (6)
In other words, by default, if a follower doesn't hear for 6 heartbeats from the leader, then after 6 * ~500 ms (the default heartbeat interval), i.e. about 3 seconds, new leaders will get elected.
It is possible to override the above settings to reduce the failover time. However, care must be exercised to not make this too aggressive, as that can cause leaders to unnecessarily ping/pong even for small network hiccups.

Related

Do Hyperledger Fabric network need odd no. of orderers if selected raft as ordering

If selected OrdererType: etcdraft, then is it must to have odd no. of orderers ?, referred to the following hyperldger fabric documentation but not found anything specific on that?
https://hyperledger-fabric.readthedocs.io/en/release-2.2/raft_configuration.html
Consenters:
- Host: raft0.example.com
Port: 7050
ClientTLSCert: path/to/ClientTLSCert0
ServerTLSCert: path/to/ServerTLSCert0
- Host: raft1.example.com
Port: 7050
ClientTLSCert: path/to/ClientTLSCert1
ServerTLSCert: path/to/ServerTLSCert1
Hyperledger Fabric wrote many pages on the documentation about RAFT, but not specifically shared a single example with raft and without a raft, not shared the word: "distributed consensus", saw hundreds of technical articles where the demonstration was done to run X orderers, but when you work on real problems then the real Q comes, how many orderers you may need? (sure all depends on the problem you working on)
Found following reference on this topic :
Reference: https://lists.hyperledger.org/g/fabric/topic/71083347
While it's possible to use the console to build a configuration of any
number of ordering nodes (no configuration is explicitly restricted),
some numbers provide a better balance between cost and performance
than others. The reason for this lies in satisfying the needs of high
availability (HA) and in understanding the Raft concept of the
"quorum", the minimum number of nodes that must be available (out of
the total number) for the ordering service to process transactions.
In Raft, a majority of the total number of nodes is needed to form a
quorum. In other words, if you have one node, you need that node
available to have a quorum, because the majority of one is one.
Similarly, if you have two nodes, you will need both available, since
the majority of two is two (for this reason, a configuration of two
nodes is discouraged; there is no advantage to a two node
configuration). In a similar vein, the majority of three is two, the
majority of four is three, the majority of five is three, and so on.
While satisfying the quorum will make sure the ordering service is
functioning, production networks also have to think about deployment
configurations that are highly available (in other words,
configurations in which the loss of a certain number of nodes can be
tolerated by the system). Typically, this means tolerating two nodes
failing: one node going down during a normal maintenance cycle, and
another going down for any other reason (such as a power outage or
error). This is why five nodes is a good number for a production
network. Recall that the majority of five is three. This means that in
a five node configuration, the loss of two nodes can be tolerated. If
your configuration features four nodes, only one node can be down for
any reason before another node going down means a quorum has been lost
and the ordering service will stop processing transactions.
Also did a few experiments e.g.
with 1 orderer
with 2 orderers
and it worked.
But even after doing the above, I got an idea that system is working with 1 orderer and 2 orderers or .., but not actually cleared like why we really need 5 to start with.
Possibly start with 1 orderer, and see the complexity of Txns, if we have multiple client applications and those reaching to blockchain network via multiple peers, submitting complex Txn proposals and then it may require 1+X orderers, but yes we have to see what problematic Txns we have. We may not require distributed consensus all time, so not many orderers.
As per latest (Fabric 2.x) documentation, it is recommended to deploy at least three and optimally five nodes in an ordering service.
wherever you choose to deploy your components, you will need to make sure you have enough resources for the components to run effectively. The sizes you need will largely depend on your use case. If you plan to join a single peer to several high volume channels, it will need much more CPU and memory than if you only plan to join to a single channel. As a rough estimate, plan to dedicate approximately three times the resources to a peer as you plan to allocate to a single ordering node (as you will see below, it is recommended to deploy at least three and optimally five nodes in an ordering service). Similarly, you should need approximately a tenth of the resources for a CA as you will for a peer. You will also need to add storage to your cluster (some cloud providers may provide storage) as you cannot configure Persistent Volumes and Persistent Volume Claims without storage being set up with your cloud provider first. The use of persistent storage ensures that data such as MSPs, ledgers, and installed chaincodes are not stored on the container filesystem, preventing them from being destroyed if the containers are destroyed.

Transactions per second vs Time for one transaction in Hyperledger

The Hyperledger on paper states that it supports 3500 transactions per second
But what is the time it takes to commit/execute one transaction?
is it equal to one block time?
can we use Hyperledger for a realtime application that needs a transaction to be done within 100 ms?
3500 transactions per second means that on average, 3500 transactions were created by some clients and then appeared in some next newly created block within one second.
It can be that several blocks were created in that second.
You can't use Fabric, or any other consensus mechanism that uses IP networks for a real time application, simply because that if the network transmission can be delayed for more than 100ms.
You can, however - make the consensus service in Fabric use low batch timeouts (say, of 10ms) and then blocks would be formed on a frequent basis and you'll have a transaction latency less than 100ms on average.

Is It Possible To Run Nodes In Bursts?

NOTE: This question is framed in the context of a private network where the business network operator owns and manages all the nodes on the network as a service and only provides access via a REST API or a web gui.
Assuming that the application is mostly batch based and not real time, is it possible to run nodes in bursts so that they start once an hour, process any transactions and then shut down again when the processing is complete?
Or maybe have a trigger that starts up the node automatically when it is needed.
Azure has per second billing which has the potential to drastically reduce infrastructure costs.
Generally speaking this wouldn't be possible. You can think of nodes being like email servers -- You never know when an email (ie a transaction for a Node) comes in, so they have to be online all the time.
However if you control all nodes on the network, you could build a queuing system outside of Corda, once an hour spin up all the nodes on the network, and then process your own queue by sending the transactions then.
This would be likely to become tricky once you have other entities you don't control on the network though. You could also run the nodes on the smallest possible instances on Azure and keep the cost down that way?

How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API?

I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.
In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.

Distributed Lock Manager with Azure SQL database

We have Web API using Azure SQL database. Database model has Customers and Managers. Customers can add appointments. We can't allow overlapping appointments from 2 or more Customers for same Manager. Because we are working in a distributed environment (multiple instances of web server can insert records into database at the same time), there is a possibility that appointments that are not valid will be saved. As an example, Customer 1 wants an appointment between 10:00 - 10: 30. Customer 2 wants an appointment between 10:15 - 10:45. If both appointments happen during the same time, then the validation code in Web API, will not catch an error. That's why we need something like distributed lock manager. We read about Redlock from Redis and Zookeeper. My questions is: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
If we would use Redlock than we would go with Azure Redis Cache because we already use Azure Cloud to host our Web API. We plan to identify shared resource (resource we want to lock) by using ManagerId + Date. This would result in lock for Manager on one date, so it would be possible to have other locks for same Manager on some other date. We plan to use one instance of Azure Redis Cache, is this safe enough?
Q1: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
I consider Redlock as not the best choice for your use case because:
a) its guarantees are for a specific amount of time (TTL) set before using the DB operation. If for some reason (talk to DevOps for incredible ones and also check How to do distributed locking) the DB operation takes longer than TTL you loose the guarantee for lock validity (see lock validity time in the official documentation). You could use large TTL (minutes) or you could try to extend its validity with another thread which would monitor the DB operation time - but this gets incredibly complicated. On the other hands with zookeeper (ZK) your lock is there till you remove it or the process dies; it could be the situation when your DB operation hangs which would lead to the lock also to hang but these kind of problems are easily spotted by DevOps tools which will kill the hanging process which in turn will free the ZK lock (there's also the option to have a monitoring process which to also do this faster and in a more specific to you business fashion).
b) while trying to lock the processes must “fight” to win a lock; the “fighting” suppose for them to wait then retry getting the lock. These could lead to retry-count to overflow which would lead to a fail to get the lock. This seems to me a less important issue but with ZK the solution is far better: there’s no “fight” but all processes will get in a line of ones waiting their turn to get the lock (check ZK lock recipe).
c) Redlock is based on time measures which is incredible tricky; check at least the paragraph containing “feeling smug” at How to do distributed locking (Conclusion paragraph too) then think again how large should be that TTL value in order to be sure about your RedLock (time) based locking.
For these reasons I consider RedLock a risky solution while Zookeeper a good solution for your use case. Other better distributed locking solution fit for your case I don’t know but other distributed locking solutions do exist, e.g. just check Apache ZooKeeper vs. etcd3.
Q2: We plan to use one instance of Azure Redis Cache, is this safe enough?
It could be safe for your use case because the TTL seems to be predictable (if we really trust the time measuring - see the warn below) but only if the slave taking over a failed master could be delayed (not sure if possible, you should check the Redis configuration capabilities). In case you loose the master before a lock is synchronized to the slave than another process could just acquire the same lock. Redlock recommends to use delayed restarts (check Performance, crash-recovery and fsync in official documentation) with a period at least of 1 TTL. If for the Q1:a+c reason your TTL is a very long one than your system won’t be able to lock for maybe an unacceptable large period (because the only 1 Redis master you have must be replaced by the slave in a delayed fashion).
PS: I stress again to read Martin Kleppmann's opinion on Redlock where you’ll find incredible reasons for a DB operation to be delayed (search for before reaching the storage service) and also incredible reasons for not relaying on time measuring when locking (and also an interesting argumentation against using Redlock)

Resources