Frequent Error when deploying Helm to GKE with Terraform - terraform

Components
GKE
Helm v3
Terraform
Note: The below error is raised, BUT IF i keep doing terraform apply/delete multiple times, it would somehow auto-resolve. I am making use of Google Cloud Console so there is no chacne of my Internet messing things up.
Error Type 1:
Error: Error reading ComputeNetwork "projects/foo/global/networks/bar-network": Get https://www.googleapis.com/compute/v1/projects/foo/global/networks/bar-network-e4l6-network?alt=json: dial tcp [1111:2222:4003:c03::5f]:443: connect: cannot assign requested address
Error Type 2:
Error reading Service Account "projects/foo/serviceAccounts/bar-sa#foo.iam.gserviceaccount.com": Get https://iam.googleapis.com/v1/projects/foo/serviceAccounts/example-cluster-sa#dravoka2.iam.gserviceaccount.com?alt=json&prettyPrint=false: dial tcp [1111:2222:4003:c04::5f]:443: connect: cannot assign requested address
Error Type 3:
Error: Error retrieving available container cluster versions: Get https://container.googleapis.com/v1beta1/projects/foo/locations/us-central1-c/serverConfig?alt=json&prettyPrint=false: dial tcp [1111:2222:4003:c03::5f]:443: connect: cannot assign requested address
Error Type 4:
Error reading instance group manager returned as an instance group URL: "googleapi: Error 404: The resource 'projects/foo/zones/us-central1-c/instanceGroupManagers/gke-bar-main-pool-8c2b8edd-grp' was not found, notFound"
I don't understand why it popsup randomly, and when I re-run the same terraform apply/delete it magically works fine!
Any guidance would help.

I had this exact same problem, and after several hours of looking at it in more detail - i think I know what is happening and how to work around it. Since implementing the following workaround, I've had 100% success rate in apply/destroy operations.
Problem:
For some reason, Terraform is accepting the AAAA (IPv6) record over the A record. You can see this in the error response as the record for the *.googleapis.com is an IPv6 address. As Google Cloud Console doesn't have IPv6 enabled, this is why you're getting this error. It seems this is a problem with Go, rather than Terraform itself based on the searches I did for similar errors.
Solution:
Short of changing the source code in Terraform, you can instead modify your /etc/hosts file to respond with an IPv4 address for each of the APIs Terraform calls. As the Cloud Shell is hosted on Google Cloud, you can use the private.googleapis.com range (199.36.153.8/30). To automate this, just put the following in your .customize_environment file in your home directory:
export APIS="googleapis.com www.googleapis.com storage.googleapis.com iam.googleapis.com container.googleapis.com cloudresourcemanager.googleapis.com"
for i in $APIS
do
echo "199.36.153.10 $i" >> /etc/hosts
done
For reference, I created an issue in the Google provider to track it.

Related

I am trying to run terraform init but getting this error: Failed to query available provider packages

Terraform init is giving the following error. No version has been upgraded and it was working few days back but suddenly it is failing.
Error: Failed to query available provider packages
Could not retrieve the list of available versions for provider hashicorp/aws:
could not connect to registry.terraform.io: Failed to request discovery
document: Get "https://registry.terraform.io/.well-known/terraform.json": read: connection reset by peer
when I run curl from the server, it is not able to connect as well.
curl https://registry.terraform.io/
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to registry.terraform.io:443
Are you on a network where an admin might have installed a proxy between you and the internet? If so, you need to get the signing certificates and configure them in your provider.
If you're on a home network or a public one, this is a man in the middle attack. Do not use this network.
If you have the certificates, they can be configured in your aws provider by pointing cacert_path, cert_path and key_path at the appropriate .pem files.
If you have verified that there is a valid reason to have a proxy between you and the internet, you are not touching production, and the certificates are hard to come by, you can test your code by setting insecure = true on your provider. Obviously, don't check that in.
I get this error from time to time. It's been frequently reported on the terrafrom github page. One particular comment always reminds me to refresh my network settings (e.g. restart network connection):
OK, I think I have isolated and resolved the issue in my case. It's
always DNS to blame in the end, right? I hardcoded CloudFlare DNS
(1.1.1.1 and its IPv4 and IPv6 aliases) into my network settings on
the laptop, and since then everything seems to be working like a
treat.
How I fix that nre relic provider downloading issue
Error while installing newrelic/newrelic v3.13.0: could not query provider registry for registry.terraform.io/newrelic/newrelic: failed to retrieve authentication checksums for provider: the request failed
│ after 2 attempts, please try again later: Get "https://github.com/newrelic/terraform-provider-newrelic/releases/download/v3.13.0/terraform-provider-newrelic_3.13.0_SHA256SUMS": net/http: request canceled
│ while waiting for connection (Client.Timeout exceeded while awaiting headers)
https://learnubuntu.com/change-dns-server/
add google nameservers here:
/etc/resolv.conf
and then check with thsi command:
dig google.com | grep SERVER
and done.
This is a temp change, will disappear when moving to the new terminal.

Python sockets and detecting no internet connection

I am trying to separate out the different types of errors that can occur when using Python's sockets module to perform internet requests so that I can handle them appropriately. Particularly, the case when no internet connection is available, which I would like to handle so that I can check and wait for availability.
The requests I am dealing with are via URLs, at the core of which is socket's getaddrinfo() function, which, primarily, resolves a hostname to an IP. For remote hosts this involves a DNS query, which obviously does not work without the internet, and hence is the first point of failure if no internet connection is available.
However, it seems that getaddrinfo() raises the same exception for both "no internet" (i.e. DNS query send failed or timed out) and a non-existent domain (i.e. internet is available and DNS query/answer completed) - and as a result I cannot detect the no internet condition that I need.
For example, running the following non-existent domain with an internet connection:
socket.getaddrinfo("www.thsidomaindoesntexist12309657.com", 80)
The resulting exception is:
socket.gaierror: [Errno 11001] getaddrinfo failed
Where errno 11001 corresponds to socket.EAI_NONAME. This is the expected behaviour, since the domain actually does not exist.
Yet when I try the following existing domain with no internet connection (network adaptor disabled):
socket.getaddrinfo("www.google.com", 80)
I get exactly the same exception as before, apparently indicating that the domain does not exist (even though we can't know because we never got a DNS response).
How can I detect no internet connection when using sockets and getaddrinfo()? Is it impossible, or is this a bug I am experiencing, or something else?
How can I detect no internet connection when using sockets and getaddrinfo()
You cannot get the information from getaddrinfo. This functions is an abstract interface to name resolution which could be done against a local DNS server but is also often done against a DNS server in the local network. The function provides no information why it failed to resolve a name since it often has no idea itself why the resolving failed.
Apart from that even if the DNS lookup succeeds there is no guarantee that there is actually a working internet connection. It might be that the result of the DNS lookup is taken from cache or it might be that one could actually resolve the name but that the actual connection to the target (like a web server) is blocked by some firewall. It might also be that one gets an IP address but that it is the wrong one as is often the case with captive portals.
The common way to find out if there is a working internet connection is to contact a known good server with a specific request and check if the response is as expected.

Unable to Add Azure DB Firewall Rule to Allow Build Server to Run Tests

We use a Visual Studio Online-hosted build server to automate our build process. As part of this I'm looking into adding unit and integration tests into this process.
These tests require access to our SQL Azure DBs (2 of them, both on the same server), which in turn requires access through the DB server's firewall.
I have a PowerShell script which uses New-AzureRmSqlServerFirewallRule to add IP addresses to the DB server, and these firewall rules are successfully showing up in the Azure portal.
Specifically, the script adds firewall rules for:
All IPv4 addresses* on the build server (as returned by Get-NetIPAddress)
Build server's external IP address (as returned by https://api.ipify.org)
In conjunction, it appears that the pre-defined AllowAllAzureIPs and AllowAllWindowsAzureIps rules are automatically added.
However, the tests subsequently fail with the exception:
System.Data.SqlClient.SqlException:
System.Data.SqlClient.SqlException: A network-related or
instance-specific error occurred while establishing a connection to
SQL Server. The server was not found or was not accessible. Verify
that the instance name is correct and that SQL Server is configured to
allow remote connections. (provider: Named Pipes Provider, error: 40 -
Could not open a connection to SQL Server)
I'm unsure why the build server is unable to reach the DB server - could it be that the host of the test processes is using yet a different IP address?
Update
As has been pointed out, the exception message mentions "Named Pipes Provider" which suggests that the DB connection is using a named pipe instead of an IP/TCP connection. To test this I changed the local app.config to contain an unknown/random/inaccessible IP and ran the tests locally (they otherwise run successfully locally): I received exactly the same exception message mentioning "Named Pipes Provider". Perhaps at some level the ReliableSqlConnection class resolves to a named pipe but my point is that I can induce this very same exception by changing to an unknown or inaccessible IP address in my DB connection string.
Furthermore, the DB connection string starts with tcp: which, as per this blog post, explicitly tells the connection to use TCP/IP and not named pipes.
I have also modified the firewall rule to permit all IP addresses (0.0.0.0 to 255.255.255.255) but the same exception is still thrown. This suggests that the SQL Azure firewall rule is not the cause of the 'blockage'.
My suspicion therefore turns to network access being blocked (though a whitelist is probably present to permit the build server to reach the code repository). I added a very simple PowerShell script to the start of the build process:
Test-Connection "172.217.18.100" #resolves to www.google.com
This results in
Testing connection to computer '172.217.18.100' failed: Error due to lack of resources
Have the build servers disabled ping/ICMP or is all outgoing traffic blocked?
* The script only considers IPv4 addresses because I haven't had any success in passing IPv6 addresses to New-AzureRmSqlServerFirewallRule.
We finally solved the issue. The problem had nothing to do with Firewalls. The issue was that the app.config files in our unit test didn't go through the transformation step that our web.config files did. So all the settings were from our local development and therefore wrong.
More about this here:
Connect to external services inside Visual Studio Online build/test task
What connection string are you using? Your error seems to indicate that this is not truly a firewall issue, but rather a connection is being attempted to a server that doesn't exist.
My * incorrect * hypothesis right now is that your connection string contains only the server name, without .database.windows.net suffix which causes the client driver to look for server on local network. The error presented appears to not be a firewall related issue.
( Edited to reflect author feedback. )
If you're connecting over TCP, then why is your error message saying Named Pipes?
[...]
(provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
I'd look into this paradox first.
The firewall test is very simple, allow 0.0.0.0 to 255.255.255.255 or 0.0.0.0/0 and re-test. My money is on the same error message.

Error connecting to ec2 instance with third party api

I am using a third party api which accepts json input and response back with a json format output. Locally I checked the api response on port 8181 and it works great. When I am deploying and testing the same on production environment over AWS, its failing with error :
Could not get any response
There seems to be an error connecting to https://ec2 instance public ip:8181/auth/raw
I am able to ping the public ip of the server. I have already tried exploring the solution but could not find any.
Please suggest how can i resolve this.
I got to solve it myself after breaking my head like anything by adding Custom TCP Rule on port 8181 over Inbound under security group of the instance.

Cloudant bulk insert errors

I'm getting a lot of errors intermittently from Cloudant, when I post several thousand ~1000-character documents 10 at a time to _bulk_docs, from a Node app running on my local machine:
Error: getaddrinfo ENOTFOUND samdutton.cloudant.com
What does this error mean?
I've found a few similar problems online, but any suggestions how to avoid this error?
"getaddrinfo" represents your machine's inability to use DNS to find an IP address for the domain name "samdutton.cloudant.com". Can you confirm that your machine is able to resolve this DNS record correctly by doing
dig samdutton.cloudant.com
or
nslookup samdutton.cloudant.com
from your command line?

Resources