Azure ACI not running still serving up an Azure functions page - azure

I deployed a new ACI using the option 'mcr.microsoft.com/azuredocs/aci-helloworld:latest (linux)' from the azure portal. Once that's deployed and running, visiting the FQDN for the container load up page below. Makes sense.
However, if I stop the ACI instance and wait a few minutes I get the following page for about the next 15 minutes. Except mine says functions 3.0. After those 15 minutes, I then get a DNS probe error message which makes sense. If my ACI is stopped why is there a function app responding to requests?

I can only speculate, but this may still be valuable information for you.
The 15 minutes gap
The 15 minutes gap sounds very much like DNS caching. When I deploy a container instance in West Europe region with hostname "my-important-container" and a public IP, I get a publicly available DNS record for it like this:
my-important-container.westeurope.azurecontainer.io
In this case, DNS record creation is done by the Azure platform for you. Microsoft engineers have probably set 15 minutes caching as a default value.
Creating a DNS record by hand, you can specify the number of seconds for which it will be cached in the global network of DNS servers, so that they don't have to resolve it using the authority server, every single time someone uses that name to access a web service. 15 minutes caching provides the ability to serve only 1 request instead of 1000, if there are 1000 requests to a website within a 15 minute time windows (from the same area, using the same non-authoritative server).
If you want to experiment with DNS caching, it is very easy using Azure. For exmaple, using Azure DNS Zones, or if you don't want to buy a domain, you can use Azure Private DNS Zones on private VNET and see how caching works.
The "Function app is up and running" phenomenon
This implies that Azure is hosting Container Instances on a common serverless platform together with Azure Functions. That IP address at that time is allocated to a serverless instance, but of course you have removed/stopped your container at that time, so the underlying layer is responding with a default placeholder message. It's kind of a falsy response at that time, because you are not actually using Functions, and your serverless workload is not actually 'up and running' at that time.
Microsoft could prevent this issue by injecting information on the context, while creating a serverless instance. That way, the instance would be aware if currently it is serving a container instance or a function, and would be able to respond with a more informative placeholder message if configured correctly.

Related

High concurrency system on Google App Engine

Here is my situation.
I have a project hosted on Google Cloud, more specifically GAE (NodeJS) and Firestore.
I have a queue stored on Firestore that it could be up to 30 - 40k entries.
Each entry is basically an object with which I'll have to make an api call to an external service.
That external service allows only 10 requests/s for one IP.
At the moment I take batches of 10 and make for each one an api call, but it's to slow.
I already tried to instantiate multiple instances of the GAE service, but I still hit the limitation ( the instances use the same ip ?! ).
Another option would be to move the making of the api call in a Cloud Function and hit it there, but I think that I would bet the same outcome as with the GAE instances.
So, what do you think ?
Many thanks!
In my opinion, the requests per second per IP limit is put in place to throttle the overall amount of incoming requests and gaming this rule may cause issue to that service. The best way to handle this situation is either to get a paid subscription or to discuss the issue directly with the service provider.
Regarding the App Engine instances and IP addresses the short answer is:
No, GAE instances don't have their own dynamic IPs.
For more reference you can confirm it in the FAQ for App Engine:
App Engine does not currently provide a way to map static IP addresses to an application. In order to optimize the network path between an end user and an App Engine application, end users on different ISPs or geographic locations might use different IP addresses to access the same App Engine application. DNS might return different IP addresses to access App Engine over time or from different network locations.
tcptraceroute to a google service shows one of these points:
lga34s14-in-f14.1e100.net
According to the description of Google Edge Network:
Our Edge Points of Presence (PoPs) are where we connect Google's network to the rest of the internet via peering. We are present on over 90 internet exchanges and at over 100 interconnection facilities around the world.
To sum it up: your application should exit the Google's network from the Edge Point closest to it's target it would make sense that it's always the same point with the same IP and from the amount of the services and the client applications GCP hosts you can expect a reverse proxy being used by Google.

Microsoft Azure Traffic Manager

I've created an Azure Traffic Manager profile which uses failover as the load balancing method. The primary endpoint is an on-premises website test.company.com. The other endpoint is an Azure Website App which has a custom domain name xxx.mysite.com. When I added the endpoint to Traffic Manager it points to mysite.azurewebsites.net.
I've created a CNAME record at the ISP to point xxx.mysite.com to mycompany.trafficmanager.net.
When I stop the primary website to simulate a failover to the second website I get Error 404 - Web App Not Found. If I go directly to mycompany.trafficmanager.net it works as expected and displays the xxx.mysite.com website.
What am I missing in the configuration so that when I failover it displays the xxx.mysite.com website?
Azure Traffic Manager is a DNS routing system, not a load-balancer. Using DNS will always have latency with changes. By default, Traffic Manager uses a TTL of 300 which is 5 minutes (300 seconds).
This means any clients (like web browsers) will only check for a new address every 5 minutes, and that's if they actually follow the TTL value and don't cache the DNS entry even longer. There are also lots of DNS proxies and caches (like in your ISP) that can still cache the old DNS entry. Any updates will take minutes at least before clients go to the failover site.
You can lower the TTL although this will increase number of queries (and resulting cost) and might decrease performance. If you absolutely can't have any downtime then you'll have to look into running an actual load-balancer that will handle the traffic directly and send it to the right place.
As of 2020, Azure now has the Front Door service which is a global load balancer that will handle the requests and failover seamlessly. Try that instead. More info here: https://azure.microsoft.com/en-us/services/frontdoor/
Can you check and see if the custom domain is also added to the web app? e.g. something.mysite.com is registered as a custom hostname with mysite.azurewebsites.net.
If that step isn't done, then when the request is routed to the azurewebsites app, it will fail because there is nothing configuration wise to indicate that something.mysite.com is really mysite.azurewebsites.net.

automatic failover if webserver is down (SRV / additional A-record / ?)

I am starting to develop a webservice that will be hosted in the cloud but needs higher availability than typical cloud SLAs provide.
Typical SLAs, e.g. Windows Azure, promise an availability of 99.9%, i.e. up to 43min downtime per month. I am looking for an order of magnitude better availability (<5min down time per month). While I can configure several load balanced database back-ends to resolve that part of the issue I see a bottleneck at the webserver. If the webserver fails, the whole service is unavailable to the customer. What are the options of reducing that risk without introducing another possible single point of failure? I see the following solutions and drawbacks to each:
SRV-record:
I duplicate the whole infrastructure (and take care that the databases are in sync) and add additional SRV records for the domain so that the user tying to access www.example.com will automatically get forwarded to example.cloud1.com or if that one is offline to example.cloud2.com. Googling around it seems that SRV records are not supported by any major browser, is that true?
second A-record:
Add an additional A-record as alternatives. Drawbacks:
a) at my hosting provider I do not see any possibility to add a second A-record but just one... is that normal?
b)if one server of two servers are down I am not sure if the user gets automatically re-directed to the other one or 50% of all users get a 404 or some other error
Any clues for a best-practice would be appreciated
Cheers,
Sebastian
The availability of the instance i.e. SLA when specified by the Cloud Provider means the "Instance's Health is server running in the context of Hypervisor or Fabric Controller". With that said, you need to take an effort and ensure the instance is not failing because of your app / OS / or pretty much anything running inside the instance. There are few things which devops tend to miss and that kind of hit back hard like for instance - forgetting to configure the OS Updates and Patches.
The fundamental axiom with the availability is the redundancy. More redundant your application / infrastructure is more availabile is your app.
I recommend your to look into the Azure Traffic Manager and then re-work on your architecture. You need not worry about the SRV record or A-Record. Just a CNAME for the traffic manager would do the trick.
The idea of traffic manager is simple, you can tell the traffic
manager to stand after the domain name ( domain name resolution of the
app ) then the traffic manager decides where to send the request on
considerations of factors like Round-Robin, Disaster Management etc.
With the combination of the Traffic Manager and multi-region infrastructure setup; you will march towards the high availability goal.
Links
Azure Traffic Manager Overview
Cloud Power: How to scale Azure Websites globally with Traffic Manager
Maybe You should configure a corosync cluster with DRBD ?
DRBD will ensure You that the data on both nodes are replicated (for example website files and db files).
Apache as web server will be available under a virtual IP to which domain is pointed. In case of one server is down corosync will move all services to second server within few seconds.

Delete Azure VM Instance from load balanced Cloud Service

I have 2 Azure vm's (Linux) being load balanced by a public Azure Cloud Service. Both instances show in the Azure Management portal for the same cloud service. I want to take down one instance and perform some maintenance. However since the instance is still showing even though the VM has been shutdown it the Cloud Service is still directing traffic to it. How do I delete an instance from the Cloud Service or stop the Cloud Service from directing traffic to a particular VM instance? Then afterwards how does one re-associate an existing VM to that service? (i.e. change from one Cloud Service to another).
Note: SSH works into the VM but other ports used by the VM are not working acting like they are trying to go to the other VM even though the correct endpoints are created to the active VM.
The purpose of a port probe in a load-balanced set is for the load balancer to be able to detect whether or not a VM is able to accept traffic. When configuring the load-balanced endpoint you can specify a webpage or a TCP endpoint for the probe - and this should be present on each instance. Traffic will be directed to the VM as long as the webpage returns 200 OK or the TCP endpoint accepts the connection when the load balancer probes. You can specify the time interval between probes and the number of probes that must fail before the endpoint is deemed dead and should be taken out of rotation (defaults are every 15 seconds and 2 probes).
You can take a VM out of load-balancer rotation by ensuring that the configured probe page returns something other than 200 OK and then bring it back into rotation by having it once again send a 200 OK.
When I have needed to keep my webservice running and returning status of 200 I have had to resort to removing the endpoint from the load-balanced set. It is pretty simple to do but it does take usually a minute for the webPortal to remove the endpoint and then again once you recreate the endpoint to put it back in the set.

Azure VM blocks page requests from specific ip

We have a VM with the following DNS on Azure:
erpone-jsl.cloudapp.net
Frequently the Default Website on this VM becomes inaccessible with the error message 'this web page can not be displayed.' However this happens only for those users who are using a particular internet service provider in the Western part of India.
If this particular ISP resets its server, the site becomes accessible.
OR if we restart this particular VM, the site becomes accessible.
But the problem recurs after a few hours or few days.
We noticed that the issue recurs when Windows 2012 (Datacenter) updates itself on the VM - but we are not sure of this, yet.
The IP pool from where this problem occurs is 116.199.168.0 to 116.199.168.21
This ISP is telling us that their IP is being blocked by Azure VM Firewall but we have not blocked or restricted any IP from our VM or IIS.
Can some one throw light on this strange phenomenon-
Page requests coming from only this range of IP is unable to access the website
but it resolves temporarily when either the VM is restarted or the ISP's Server is restarted.
Not sure what's happening in this particular case, but... Windows Azure doesn't offer a per-cloud-service firewall that provides IP-blocking. This is done at the VM level (your VMs).
That said: Have you tried accessing any other Windows Azure cloud services from those IP addresses which are having trouble accessing erpone-jsl.cloudapp.net? This would be a good test to perform, with test web applications in the same data center as erpone-jsl as well as one in another data center. You can also try putting a test website up in Azure Web Sites, free tier, which will let you deploy something in just a few minutes, with no cost, for testing purposes. You can use ftp to download diagnostic logs from Web Sites.
One thing you can do is inspect the IIS logs (Virtual Machines, or web/worker roles of Cloud Services), to see if the offending IP addresses show up in the log (and I'd suggest checking the logs of the test web site cloudapp.net's as well).
If all else fails, you can then open a support ticket, to see if Azure support has any light to shed on this.

Resources