Kubernetes Pod - read ECONNRESET when requesting external web services

Kubernetes Pod - read ECONNRESET when requesting external web services - linux

I have a bare-metal Kubernetes cluster setup running on separate lxc containers as nodes on Ubuntu 20.04 . It has Istio service mesh configured and approx 20 application services running on it (ServiceEntries for Istio are created for external services to be reached). I use MetalLB for the gateway's external IP provisioning.
I have an issue with pods making requests outside the cluster (egress), specifically reaching some of the external web services such as Cloudflare API or Sendgrid API to make some REST API calls. The DNS is working fine as the hosts I try to reach are indeed reachable from the pods (containers). It happens only that the first time pod is successful at making requests outside to the internet and after that random read ECONNRESET error happens when it tries to make REST API calls and sometimes even connect ETIMEDOUT but not frequent as the first error. Making the network requests from the nodes themselves to the internet seem to have no problems at all. Pods communicate with each other through the k8s' services fine without any of the problems.
I would guess something is not configured correctly and that the packets are not properly delivered back to the pod but I can't find any relevant help on the internet and I am a little bit lost on this one, I appreciate and I am very grateful for any of your help! I will happily provide more details if needed.
Thank you all once again!

Related

Azure App Gateway Back-End Health State Flipping

I have an Azure App Gateway connected to 3 different App Service apps all running as part of the same App Service Plan (3 different back-end pools). In the Backend Health section of the AG, one of the app/pool is constantly flipping between Healthy and Unknown states. I have checked the entire network configuration according to this article and everything seems to be configured properly.
I have configured ip restrictions on the app services according to this article specifying the subnet the AG resides in as allowed. I have also temporarily allowed my ip address and every time the health for the 1 app goes to "unknown", I am still able to access the app service using its native .azurewebsites.net url locally on my machine.
Any ideas how I can troubleshoot this?

Please check if below points help to work around the issue.
As a workaround initially,try to restart the application gateway after the backend is deployed .
Also check this discussion on github issue
Sometimes Appgateway will cache the response indefinitely and the fix
maybe "Dynamic DNS" which ensures that the "no existing domain" is not
cached on the Appgw.Also check for the fix using v16.
Also check this > similar issue which says to use custom domain names as the request looks for some domain.

How do make my microservices only accessible by the api gateway

I would like to know how I can protect my Nodejs microservices so only the API gateway can access it. Currently the microservices are exposed on a unique port on my machine and can be access directly without passing through the gateway. That defeats the purpose of the gateway to serve as the only entry point in the system for secure and authorized information exchange.
The microservices and the gateway are currently built with Nodejs and express.
The plan is to eventually deploy it on the cloud (digital ocean). I'd appreciate any response. Thanks.

Kubernetes can solve this problem.
Kubernetes manages containers where each container can be a micro service.
While connecting your micro services to your gateway server, you can choose to only allow foreign connections to your gateway server. You would have a load balancer / nginx in your kubernetes cluster that redirects request to your gateway server.
Kubernetes has many other features such as:
service discovery: each of your micro service's IP could potentially change on restart/deployment unless you have static IP for all ur services. service discovery solves this problem.
high availability & horizontal scaling & zero downtime: you can configure to have several replicas for each of your service. So when one of the service goes down there still are other replicas alive to deal with the remaining requests. This also helps with CICD. With something like github action, you can make a smooth CICD pipeline. When you deploy a new docker image(update a micro service), kubernetes will launch a new container first and then kill the old container. So you have zero down time.
If you are working with micro services, you should definitely have a deep dive into kubernetes.

How can I diagnose a connection failure to my Load-balanced Service Fabric Cluster in Azure?

I'm taking my first foray into Azure Service Fabric using a cluster hosted in Azure. I've successfully deployed my cluster via ARM template, which includes the cluster manager resource, VMs for hosting Service Fabric, a Load Balancer, an IP Address and several storage accounts. I've successfully configured the certificate for the management interface and I've successfully written and deployed an application to my cluster. However, when I try to connect to my API via Postman (or even via browser, e.g. Chrome) the connection invariably times out and does not get a response. I've double checked all of my settings for the Load Balancer and traffic should be getting through since I've configured my load balancing rules using the same port for the front and back ends to use the same port for my API in Service Fabric. Can anyone provide me with some tips for how to troubleshoot this situation and find out where exactly the connection problem lies ?
To clarify, I've examined the documentation here, here and here

Have you tried logging in to one of your service fabric nodes via remote desktop and calling your API directly from the VM? I have found that if I can confirm it's working directly on a node, the issue likely lies within the LB or potentially an NSG.

Authorize over WebSocket connection to remote kubernetes api-server

See the connected question - Kubernetes pod exec API exception: Response must not include 'Sec-WebSocket-Protocol' header if not present in request.
I have been able to successfully make a WebSocket connection using Pod exec API. But I am using kubectl proxy on localhost to handle the authorization on behalf of the terminal client.
The next step is to be able to authorize the request directly to the kubernetes API server, so that there's no need to route the traffic via kubectl proxy. Here's a discussion in the python community where they have been able to send Authorization token to the api-server. But I haven't had any success with this in nodejs. I must admit that I am not familiar with python as well to understand the discussion enough.
Can someone from the kubernetes team point me in the right direction?
Thanks

For future wanderers....
Although the exec API supports Authorization header, the browser WebSocket API doesn't support it yet. So the solution for us was to reverse-proxy it from our server APIs.
It went like this...
client browser -wss-> GKE LB (SSL Termination) -ws-> site API (nodejs) -WSS & Authorization-> kube api-server exec API
So to answer my own question, per my tests, the GKE kubernetes supports Authorization only in headers, so you need to reverse proxy if you want to connect to it via browser. Per this code, some Kubernetes setups allow tokens in the query string, but I didn't have any success with GKE. If you are using a different cluster host, YMMY. I welcome comments from kubernetes team on my observations.
If you came here only for an authorization issue, you may stop reading further.
There are still more challenges to overcome though, and there's good news and bad news... the good news first:
GKE Loadbalancer automatically handles SSL termination even for WebSockets, so you can proxy to either WS or WSS without any issues.
And then the bad news:
GKE Loadbalancer force terminates ALL connections within 30 seconds, even if they are in use! There are workarounds, but they either don't stay put, require you to deploy your own controller, or you need to use Ingress. What this means for a Terminal sessions is that Chrome will close the client with a 1006 code, even if a command is running at that time.
For some WS scenarios, it may be acceptable to simply reconnect on a 1006 close, but for a terminal session, this is a deal-breaker as you cannot reconnect to the previous terminal instance and must begin with a new one.
For now we have resorted to increasing the timeout of the GKE Loadbalancer. But eventually we are planning to deploy our own Loadbalancer which can handle this better. Ingress has some issues which we don't want to live with at the moment.

Outbound HTTP calls occasionally being blocked on Azure VM

EDIT: 3/16/17
Today we tried creating a new VM in the same private network and resource group (using the same network protection group as well), and ran the test, it worked perfectly. It makes no sense.
using PSping to diagnose this (using port 443 and 8001 to test) we can see the dropped traffic there as well, so it's not the application. The same tests on the other test Azure VM work flawlessly. So it seems it is this particular VM, I just don't understand why, since nothing has changed. We also see dropped traffic to other random sites over 443 (ping and http are flawless).
We've tried rebooting and redeploying with no luck
Original:
Our Azure VM is experiencing random failures in creating connections to several 3rd party systems. occurring in a seemingly random fashion. (every 5-10 min or so an http call will fail)
Since Monday morning, we are seeing web service (http) calls made from the Azure VM fail in a seemingly random fashion- We are getting error messages suggesting the endpoint host is simply not responding. We have engaged both 3rd parties and it appears that these http calls are not reaching their servers at all. Everything was working fine up until Monday, and no changes have been made to the system.
We think the Azure VM (or Azure Networking Limits) are causing the problem because:
We created and deployed the same “test” program on both the Azure VM and an on-prem test VM with the same specs, and the program works fine using our on premise VM.
a. This program simply makes information request calls- a single one to each 3rd party system; the program is run every minute. Thus, both servers (Azure VM and On-prem VM) are repeating identical calls on the same schedule.
b. On the test server, the success rate is 100%- we have seen no errors, even when I bumped it up to try every 10 seconds).
c. On the production server, we see frequent errors connecting to both systems.
In looking at the IIS logs from the 3rd Parties, We see blank spots when we see failed http calls In any event, no suspicious activity seems to be showing up in their case and their logs only show the successful calls.
We show only 10-20 TCP connections on the server, so we are not close to hitting the Azure 500k TCP connection limit.
Pings and tests to any site on the internet from the Azure VM seem to work fine, so network connectivity seems fine.
Are we hitting some other kind of limit on the Azure system, that could be causing these random errors?
I noticed this person had a similar issue but no resolution was found.(Azure VM outbound HTTP is unreliable)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string