Shutdown machine while inserting data stops the network

Shutdown machine while inserting data stops the network - hyperledger-fabric

I have my network running with 3 machines, each one with:
1 orderer
1 peer
1 ca
1 Node.Js client
They are deployed on AWS and a load balancer correctly distributes the requests to the 3 different clients.
The clients are not using discovery service, it is disabled.
Client1 only contacts orderer1 and peer1 and ca1, and so on and so forth for the others.
I want to try the high availability of Hyperledger so when I am inserting data I shutdown a machine, let's suppose machine1, and others should continue the execution.
What happens is that while the machine is down, the network stops the execution. The clients are not moving at all (they do not crash, just stop).
When I bring up the machine again, I see errors coming but it continues the execution now.
It seems like there are calls to machine 1 suspended but they recover as soon as the machine is up.
What I want is that if machine1 goes down, the requests to it are rejected and machine 2-3 continue the execution.
How to obtain it?
[EDIT] Additional information: I have inserted some logs in the client, especially in my endpoint for creation of transactions. Like this:
console.log('Starting Creation')
await contract.submitTransaction(example)
console.log('Creation done')
res.send(200)
Let me also say that this rows are encapsulated in an error handler, so that if any error occurs, I encapsulate the error.
But I get no error, I just get the first print done and the submitTransaction working for a lot of time, never receiving answers.
It seems like it tries to deliver request to orderer but orderer is not online.
When I bring down an orderer with docker service scale orderer1=0 (since I am using services with docker swarm), the orderer leader knows in the logs that he went offline. Also, if I bring the orderer up again, a new election starts.
This seems correct, in fact the problem only happens when I shutdown the machine, closing the connection in a non-friendly way.

Related

How does Application Gateway prevent requests being sent to recently terminated pods?

I'm currently researching and experimenting with Kubernetes in Azure. I'm playing with AKS and the Application Gateway ingress. As I understand it, when a pod is added to a service, the endpoints are updated and the ingress controller continuously polls this information. As new endpoints are added AG is updated. As they're removed AG is also updated.
As pods are added there will be a small delay whilst that pod is added to the AG before it receives requests. However, when pods are removed, does that delay in update result in requests being forwarded to a pod that no longer exists?
If not, how does AG/K8S guarantee this? What behaviour could the end client potentially experience in this scenario?

Azure Application gateway ingress is an ingress controller for your kubernetes deployment which allows you to use native Azure Application gateway to expose your application to the internet. Its purpose is to route the traffic to pods directly. At the same moment all questions about pods availability, scheduling and generally speaking management is on kubernetes itself.
When a pod receives a command to be terminated it doesn't happen instantly. Right after kube-proxies will update iptables to stop directing traffic to the pod. Also there may be ingress controllers or load balancers forwarding connections directly to the pod (which is the case with an application gateway). It's impossible to solve this issue completely, while adding 5-10 seconds delay can significantly improve users experience.
If you need to terminate or scale down your application, you should consider following steps:
Wait for a few seconds and then stop accepting connections
Close all keep-alive connections not in the middle of request
Wait for all active requests to finish
Shut down the application completely
Here are exact kubernetes mechanics which will help you to resolve your questions:
preStop hook - this hook is called immediately before a container is terminated. This is very helpful for graceful shutdowns of an application. For example simple sh command with "sleep 5" command in a preStop hook can prevent users to see "Connection refused errors". After the pod receives an API request to be terminated, it takes some time to update iptables and let an application gateway know that this pod is out of service. Since preStop hook is executed prior SIGTERM signal, it will help to resolve this issue.
(example can be found in attach lifecycle event)
readiness probe - this type of probe always runs on the container and defines whether pod is ready to accept and serve requests or not. When container's readiness probe returns success, it means the container can handle requests and it will be added to the endpoints. If a readiness probe fails, a pod is not capable to handle requests and it will be removed from endpoints object. It works very well with newly created pods when an application takes some time to load as well as for already running pods if an application takes some time for processing.
Before removing from the endpoints readiness probe should fail several times. It's possible to lower this amount to only one fail using failureTreshold field, however it still needs to detect one failed check.
(additional information on how to set it up can be found in configure liveness readiness startup probes)
startup probe - for some applications which require additional time on their first initialisation it can be tricky to set up a readiness probe parameters correctly and not compromise a fast response from the application.
Using failureThreshold * periodSecondsfields will provide this flexibility.
terminationGracePeriod - is also may be considered if an application requires more than default 30 seconds delay to gracefully shut down (e.g. this is important for stateful applications)

Default time for orderers to detect change in their endpoints in system channel

I was trying to migrate my Hyperledger Fabric network (running a RAFT ordering service) from one host to another.
In this process, I was making sure that the TLS communication is respected, which means that I made required changes in the system channel before migration process. I used the backup and genesis block (of old ordering service) to restore the network on target host. One new thing that I found was that when the orderer nodes started at new host, it took 10 minutes for them to sync blocks and start the RAFT election.
The question is: Is this default time configured in the orderer code-base or is it some other functionality?
NOTE: I know the that addition of an existing orderer node in some application channel takes 5 minutes by default for that orderer to detect the change. So, is the above situation something similar to this or is a different capability?
The complete orderer node (one that was started first on new host) logs can be found here.

Eviction suspicion is a mechanism which triggers after a default timeout of 10 minutes.

What can cause "idle in transaction" for "BEGIN" statements

We have a node.js application that connects via pg-promise to a Postgres 11 server - all processes are running on a single cloud server in docker containers.
Sometimes we hit a situation where the application does not react anymore.
The last time this happened, I had a little time to check the db via pgadmin and it showed that the connections were idle in transaction with statement BEGIN and an exclusive lock of virtualxid
I think the situation is like this:
the application has started a transaction by sending the BEGIN sql command to the db
the db got this command and started a new transaction and thus acquired an exclusive lock of mode virtualxid
now the db waits for the application to send the next statement/s (until it receives COMMIT or ROLLBACK) - and then it will release the exclusive lock of mode virtualxid
but for some reason it does not get anymore statements:
I think that the node.js event-loop is blocked - because at the time, when we see these locks, the node.js application does not log anymore statements. But the webserver still gets requests and reported some upstream timed out requests.
Does this make sense (I'm really not sure about 2. and 3.)?
Why would all transactions block at the beginning? Is this just coincidence or is the displayed SQL maybe wrong?
BTW: In this answer I found, that we can set idle_in_transaction_session_timeout so that these transactions will be released after a timeout - which is great, but I try to understand what's causing this issue.

The transactions are not blocking at all. The database is waiting for the application to send the next statement.
The lock on the transaction ID is just a technique for transactions to block each other, even if they are not contending for a table lock (for example, if they are waiting for a row lock): each transaction holds an exclusive lock on its own transaction ID, and if it has to wait for a concurrent transaction to complete, it can just request a lock on that transaction's ID (and be blocked).
If all transactions look like this, then the lock must be somewhere in your application; the database is not involved.
When looking for processes blocked in the database, look for rows in pg_locks where granted is false.

Your interpretation is correct. As for why it is happening, that is hard to say. It seems like there is some kind of bug (maybe an undetected deadlock) in your application, or maybe in nodes.js or pg-promise. You will have to debug at that level.

As expected the problems were caused by our application code. Transactions were used incorrectly:
One of the REST endpoints started a new transaction right away, using Database.tx().
This transaction was passed down multiple levels, but one function in the chain had an error and passed undefined instead of the transaction to the next level
the lowest repository level function started a new transaction (because the transaction parameter was undefined), by using Database.tx() a second time
This started to fail, under heavy load:
The connection pool size was set to 10
When there were many simultaneous requests for this endpoint, we had a situation where 10 of the requests started (opened the outer transaction) and had not yet reached the repository code that will request the 2nd transaction.
When these requests reached the repository code, they request a new (2nd) connection from the connection-pool. But this call will block because there are currently all connections in use.
So we have a nasty application level deadlock
So the solution was to fix the application code (the intermediate function must pass down the transaction correctly). Then everything works.
Moreover I strongly recommend to set a sensible idle_in_transaction_session_timeout and connection-timeout. Then, even if such an application-deadlock is introduced again in future versions, the application can recover automatically after this timeout.
Notes:
pg-postgres before v 10.3.4 contained a small bug #682 related to the connection-timeout
pg-promise before version 10.3.5 could not reocver from an idle-in-transaction-timeout and left the connection in a broken state: see pg-promise #680
Basically there was another issue: there was no need to use a transaction - because all functions were just reading data: so we can just use Database.task() instead of Database.tx()

Is there a way to test failover in amazonMQ?

I'm testing the master-slave functionality of amazonMQ and I would like to trigger a failover under load and ensure that all messages sent are received by my subscribers.
I can't see any options in the gui or cli to trigger a failover scenario. I have tried rebooting the broker but this affects both nodes. Predictably this caused my test to fail as the standby broker was not available when my clients tried to reconnect.
There is some comfort in that my clients did try and reconnect to the standby broker immediately but I still can't be sure that all messages would get through if a failover scenario would occur.
Is this possible at all?

Hyperledger orderer <-> peer connection breaks after a while on MS Azure

I have 2 different machines in cloud.
Containers on first machine:
orderer.mydomain.com
peer0.org1.mydomain.com
db.peer0.org1.mydomain.com
ca.org1.mydomain.com
Containers on second machine:
peer0.org2.mydomain.com
db.peer0.org2.mydomain.com
ca.org2.mydomain.com
I start them both. I can make them both join the same channel. I deploy a BNA exported from hyperledger composer to both peers. I send transactions to peer0.org1.mydomain.com and query and get same results from peer0.org2.mydomain.com.
Everything works perfectly so far.
However after 5 - 10 minutes peer on second machine (peer0.org2) gets disconnected from the orderer. When I send transactions to org1 I can query them from org1 and I see the results. But org2 gets detached. Doesn't accept new tx. (orderer connection gone) I can query org2 and see old results.
I added CORE_CHAINCODE_KEEPALIVE=30 to my peer env variables. I see keep alive actions in org2 peers logs. But didn't solve my problem.
I should note: Containers are in a docker network called "basic". This network was used in my local computer. However it still works in cloud.
In orderer logs:
Error sending to stream: rpc error: code = Internal desc = transport is closing
This happens every time I try. But when I run these containers in my local machine they keep connected without problems.
EDIT1: After checking the logs: peer0.org2 receives all tx and sends them to orderer. Orderer receives requests from peer but can't update peers. I can connect to both requestUrl or eventUrl on the problematic peer. There is no network problem.

I guess I found the problem. It is about MS Azure networking. After 4 minutes azure cuts idle connections:
https://discuss.pivotal.io/hc/en-us/articles/115005583008-Azure-Networking-Connection-idle-for-more-than-4-minutes
EDIT1:
Yes the problem was about MS Azure.. If there is anyone out there trying to run hyperledger on Azure keep in mind that if peer stays idle for more than 4 minutes azure times out tcp connections. You can configure it to timeout in 30 mins. It is not a bug but it was annoying for us not being able to understand why it wasn't working after 4 mins.
So you can use your own server or other cloud solution or use azure by adapting to their rules.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string