Azure DNS Time response - azure

We have a application hosted on top of compute instance Azure Cloud. The DNS Query seems to be quite Slow. Can we check somehow why the response is so slow and whether there is some caching at the OS Level.

The reason behind the Azure DNS slow response maybe due to below:
When you create new DNS zones and DNS records in the Azure DNS name servers they will appear quickly in few seconds.
When you are trying to modify existing DNS records this may take a little longer.
It may take up to 60 seconds to reflect in Azure DNS name servers.
As you mentioned, 'DNS caching by DNS clients and DNS recursive resolvers outside of Azure DNS also can affect timing.'
Use Time-To-Live (TTL) property to manage cache duration for the record set.
Time-To-Live (TTL) represents how long each record is cached by clients before being re-queried.
TTL value ranges between 1 and 2,147,483,647 seconds.
For more in detail, please refer the below links:
https://learn.microsoft.com/en-us/azure/dns/dns-faq#how-long-does-it-take-for-dns-changes-to-take-effect-
https://learn.microsoft.com/en-us/azure/dns/dns-zones-records#time-to-live

Related

Azure ACI not running still serving up an Azure functions page

I deployed a new ACI using the option 'mcr.microsoft.com/azuredocs/aci-helloworld:latest (linux)' from the azure portal. Once that's deployed and running, visiting the FQDN for the container load up page below. Makes sense.
However, if I stop the ACI instance and wait a few minutes I get the following page for about the next 15 minutes. Except mine says functions 3.0. After those 15 minutes, I then get a DNS probe error message which makes sense. If my ACI is stopped why is there a function app responding to requests?
I can only speculate, but this may still be valuable information for you.
The 15 minutes gap
The 15 minutes gap sounds very much like DNS caching. When I deploy a container instance in West Europe region with hostname "my-important-container" and a public IP, I get a publicly available DNS record for it like this:
my-important-container.westeurope.azurecontainer.io
In this case, DNS record creation is done by the Azure platform for you. Microsoft engineers have probably set 15 minutes caching as a default value.
Creating a DNS record by hand, you can specify the number of seconds for which it will be cached in the global network of DNS servers, so that they don't have to resolve it using the authority server, every single time someone uses that name to access a web service. 15 minutes caching provides the ability to serve only 1 request instead of 1000, if there are 1000 requests to a website within a 15 minute time windows (from the same area, using the same non-authoritative server).
If you want to experiment with DNS caching, it is very easy using Azure. For exmaple, using Azure DNS Zones, or if you don't want to buy a domain, you can use Azure Private DNS Zones on private VNET and see how caching works.
The "Function app is up and running" phenomenon
This implies that Azure is hosting Container Instances on a common serverless platform together with Azure Functions. That IP address at that time is allocated to a serverless instance, but of course you have removed/stopped your container at that time, so the underlying layer is responding with a default placeholder message. It's kind of a falsy response at that time, because you are not actually using Functions, and your serverless workload is not actually 'up and running' at that time.
Microsoft could prevent this issue by injecting information on the context, while creating a serverless instance. That way, the instance would be aware if currently it is serving a container instance or a function, and would be able to respond with a more informative placeholder message if configured correctly.

Setting up DNS records with Google Compute Instance

We are transferring a site from Rackspace to Google Compute. I have the instance I need and have been working with it. I want to set up the DNS zone and records in Google ahead of the actual move if possible as we have many CNAME records for dynamic sub-domains that I will need to key in (I've looked into automating the DNS record transfer but it does not transfer the sub-domain records). What records can I set up ahead of time?
Thank you,
Sally
You can set up DNS records in the GCP zone ahead of time, so long as they will not conflict with your existing records. Otherwise, it may be best to wait and do the full zone file migration, i.e., pick a time and do it all at once, and exactly copy your current records over. Please take a look at this GCP documentation regarding migration to Cloud DNS [1] that describes how to complete the necessary steps: creating a managed zone for your domain, importing your existing DNS configuration, and updating your registrar's name service records. Also, this document [2] provides a simple example of creating a managed zone, and then setting up Address (A) and Canonical Name (CNAME) records for the domain. As for subdomains, it would really depend on if they are resident in the main zone or not. If not, then you would also need to migrate the subdomain zones as well (in the same process). Hope this helps.
[1] https://cloud.google.com/dns/docs/migrating
[2] https://cloud.google.com/dns/docs/quickstart

Browser failing to renegotiate DNS on persistent connection

I’m investigating a scenario with a live dashboard (Angular web app) that is refreshed every 5 seconds (polling). The API is sitting behind Azure Traffic Manager which will fail over to a second region in the event of a failure in the primary region. Keep in mind, Azure Traffic Manager works at the DNS level.
The problem I am facing is that the browser maintains a persistent connection to the primary region even after the Traffic Manager has failed over. The requests initially fail with 503s, but then continue to fail with 502s. The DNS lookup is never performed again as the requests occur more frequently than the keep-alive timeout. This causes the browser to continue to make requests to the failed region.
Is there anyway to explicitly kill the connection to force a DNS lookup? The only way I’ve found so far is to stop making requests for 2 minutes, or to close and reopen the browser. Neither is an acceptable solution for a dashboard that is supposed to be hands off and always fresh.
What’s interesting is after getting the browser to fail over to the secondary region, if I restart the primary region the browser will automatically switch back to the primary region after about a minute. This tells me the connection is respecting the DNS TTL when the service is functioning properly, but not when the server is unavailable. This makes no sense to me why the browser would lock onto a single IP forever when it’s not found.
Is there something I am missing about implementing georedundant failover with Traffic Manager for a web application? It seems very odd to me that the user would have to stop making requests for 2 minutes in any scenario before the browser would renegotiate the IP to the failed over server. Is it expected to turn of keep-alive to truly support near instant failover?
Here's a diagram that describes this scenario:
Diagram
Generally, Azure Traffic Manager works at the DNS level. Clients connect to the service endpoint directly, not through Traffic Manager. Traffic Manager has no way to track individual clients and cannot implement 'sticky' sessions.
For initial DNS lookup performance impact, you could find the explanation details here1 and here2
DNS name resolution is fast and results are cached. The speed of the
initial DNS lookup depends on the DNS servers the client uses for name
resolution. Typically, a client can complete a DNS lookup within ~50
ms. The results of the lookup are cached for the duration of the DNS
Time-to-live (TTL). The default TTL for Traffic Manager is 300
seconds.
The TTL value of each DNS record determines the duration of
the cache. Shorter values result in faster cache expiry and Longer
values mean that it can take longer to direct traffic away from a
failed endpoint. Traffic Manager allows you to configure the TTL as
low as 0 seconds and as high as 2,147,483,647 seconds. You could
choose the value that best balances the needs of your application.
Like the above, if you want the DNS lookup faster, you could set the TTL value as low as possible. Once the connection set up, the clients persistently connect to the selected endpoint until the endpoint is unhealthy via the health check.
You can enable and disable Traffic Manager profiles and endpoints. However, a change in endpoint status also might occur as a result of Traffic Manager automated settings and processes.. Get more details here.
For Geographic routing method,
The endpoint mapped to serve the geographic location based on the
query request IP’s is returned. If that endpoint is unavailable,
another endpoint will not be selected to failover to, since a
geographic location can be mapped only to one endpoint in a profile
(more details are in the FAQ). As a best practice, when using
geographic routing, we recommend customers to use nested Traffic
Manager profiles with more than one endpoint as the endpoints of the
profile.

Microsoft Azure Traffic Manager

I've created an Azure Traffic Manager profile which uses failover as the load balancing method. The primary endpoint is an on-premises website test.company.com. The other endpoint is an Azure Website App which has a custom domain name xxx.mysite.com. When I added the endpoint to Traffic Manager it points to mysite.azurewebsites.net.
I've created a CNAME record at the ISP to point xxx.mysite.com to mycompany.trafficmanager.net.
When I stop the primary website to simulate a failover to the second website I get Error 404 - Web App Not Found. If I go directly to mycompany.trafficmanager.net it works as expected and displays the xxx.mysite.com website.
What am I missing in the configuration so that when I failover it displays the xxx.mysite.com website?
Azure Traffic Manager is a DNS routing system, not a load-balancer. Using DNS will always have latency with changes. By default, Traffic Manager uses a TTL of 300 which is 5 minutes (300 seconds).
This means any clients (like web browsers) will only check for a new address every 5 minutes, and that's if they actually follow the TTL value and don't cache the DNS entry even longer. There are also lots of DNS proxies and caches (like in your ISP) that can still cache the old DNS entry. Any updates will take minutes at least before clients go to the failover site.
You can lower the TTL although this will increase number of queries (and resulting cost) and might decrease performance. If you absolutely can't have any downtime then you'll have to look into running an actual load-balancer that will handle the traffic directly and send it to the right place.
As of 2020, Azure now has the Front Door service which is a global load balancer that will handle the requests and failover seamlessly. Try that instead. More info here: https://azure.microsoft.com/en-us/services/frontdoor/
Can you check and see if the custom domain is also added to the web app? e.g. something.mysite.com is registered as a custom hostname with mysite.azurewebsites.net.
If that step isn't done, then when the request is routed to the azurewebsites app, it will fail because there is nothing configuration wise to indicate that something.mysite.com is really mysite.azurewebsites.net.

will increase TTL the availability of website in case nameserver is out of order

Suppose registrar's Names server goes down, then my website will not available.
For sure they have a second name server.
But I will not rely on that one (in case of an serious outage at the registrar)
I'm not able to configure a second name server outsite of the registrar.
So I was thinking: will increase the TTL to let's say 24 hours decrease the dependency of the availability of the name server?
So, if the outage is less than 24 hours, and the TTL is 24 hours, will my website be available despite the name server outage?
How requests to your site actually work:
Your site is just an IP
Your site has a domain name - a sugar which helps users to remember it
When a program wants to access your site by domain name - it does a system call to OS: gethostbyname which returns an IP address of the site and then program continues.
OS has a DNS cache - which means that it queries nameservers only when cache is expired.
OS queries not nameservers of your site, but nameservers which are preconfigured by provider (if user didnt override the setting).
Provider checks its cache and if it is expired/not exists - queries root servers
Root servers check their cache and if if it is expired/not exists - query domain zone servers
Domain zone servers contain info about the site (because it is registered in them) and if site IP is not cached in them - they query its nameserver and get the IP
Soo.. in all steps between 3-8 your site ip address is getting cached and it all depends on cache ttl on all those servers and when the site was accessed last time on them.
To answer your question - TTL of 24h increases availability but not significantly.

Resources