Azure VM outbound HTTP is unreliable - azure

I have setup and Azure VM and installed a monitoring service that reaches out to various endpoints to verify a 200 response. The service is set to cycle through about 8 URL endpoints every 5 minutes or so.
We have run this service from multiple other servers outside of Azure, including virtual machines that are cheap, low end offerings.
While this machine is running on the lowest A0, it isn't doing anything else other than to run this service and call out to the various endpoints.
We are getting intermittent periods where one of the calls out of the list will fail for different periods that span 10-40 minutes at random periods several times a day.
The site or sites that fail are totally random and there is no down time from other monitor locations. We are sure that the connection problem is between Azure and the endpoints outside of Azure. There is no problem from anywhere outside of Azure.
I'm trying to figure out what could be causing this issue. It concerns me because we will be adding more services to Azure soon that use outside HTTP calls for credit card authorization and other API's.
Is this a known issue where outbound calls just don't function reliably at periods, or am I missing something in the setup or security settings?
Obviously, if the call makes it out and the response doesn't make it back, that is even worse as credit card charges would end up being pushed and the application would not register the proper response.
Anyone with some experience or insight would be greatly appreciated.
Thanks!

I find that very disturbing and hard to believe since, among a lot of other stuff, I run a service like that too... In my case I reach out to several (today, about 70) external addresses on both IPV4 and IPV6. I don't run A0s, and most of my machines are A3. I'll start a A0 to test it... if anything turns out <terminator>i'll be back</terminator> to report...
I know that there are several limitations regarding network traffic but i don't think you can reach them the way you're reporting...
My suggestion is to report that problem directly to MS via support ticket... most likely the problem is on the other side...

Related

AspNet Core on Azure App Service peaks to 100% CPU and app gateway load balancer not working

We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.

What does the Azure Web Apps architecture look like?

I've had a few outages of 10 to 15 minutes, because apparently Microsoft had a 'blip' on their storages. They told me that it is because of a shared file system between the instances (making it a single point of failure?)
I didn't understand it and asked how file share is involved, because I would assume a really dumb stateless IIS app that communicates with SQL Azure for its data.
I would assume the situation below:
This is their reply to my question (I didn't include the drawing)
The file shares are not necessarily for your web app to communicate to
another resources but they are on our end where the app content
resides on. That is what we meant when we suggested that about storage
being unavailable on our file servers. The reason the restarts would
be triggered for your app that is on both the instances is because the
resources are shared, the underlying storage would be the same for
both the instances. That’s the reason if it goes down on one, the
other would also follow eventually. If you really want the
availability of the app to be improved, you can always use a traffic
manager. However, there is no guarantee that even with traffic manager
in place, the app doesn’t go down but it improves overall availability
of your app. Also we have recently rolled out an update to production
that should take care of restarts caused by storage blips ideally, but
for this feature to be kicked it you need to make sure that there is
ample amount of memory needs to be available in the cases where this
feature needs to kick in. We have couple of options that you can have
set up in order to avoid any unexpected restarts of the app because of
a storage blip on our end:
You can evaluate if you want to move to a bigger instance so that
we might have enough memory for the overlap recycling feature to be
kicked in.
If you don’t want to move to a bigger instance, you can always use
local cache feature as outlined by us in our earlier email.
Because of the time differences the communication takes ages. Can anyone tell me what is wrong in my thinking?
The only thing that I think of is that when you've enabled two instances, they run on the same physical server. But that makes really little sense to me.
I have two instances one core, 1.75 GB memory.
My presumption for App Service Plans was that they were automatically split into availability sets (see below for a brief description) Largely based on Web Apps sales spiel which states
App Service provides availability and automatic scale on a global data centre infrastructure. Easily scale applications up or down on demand, and get high availability within and across different geographical regions.
Following on from David Ebbo's answer and comments, the underlying architecture of Web apps appears to be that the VM's themselves are separated into availability sets. However all of the instances use the same fileserver to share the underlying disk space. This file server being a significant single point of failure.
To mitigate this Azure have created the WEBSITE_LOCAL_CACHE_OPTION which will cache the contents of the file server onto the individual Web App instances. Using caching in lieu of solid, high availability engineering principles.
The problem here is that as a customer we have no visibility into this issue, we've no idea if there is a plan to fix it, or if or when it will ever be fixed since it seems unlikely that Azure is going to issue a document that admits to how badly this has been engineered, even if it is to say that it is fixed.
I also can't imagine that this issue would be any different between ASM and ARM. It seems exceptionally unlikely that there was originally a high availability solution at the backend that they scrapped when ARM came along. So it is very likely that cloud services would suffer the exact same issue.
The small upside is that now that we know this is an issue, one possible solution would be to deploy multiple web apps and have a traffic manager between them. Even if they are in the same region, different apps should have different backend file servers.
My first action would be to reply to that email, with a link to the Web Apps page, (and this question) with a copy of the quote and ask how to enable high availability within a geographic region.
After that you'll likely need to rearchitect your solution!
Availability sets
For virtual machines Azure will let you specify an availability set. An availability set will automatically split VMs into separate update and fault domains. Meaning that servers will end up in different server racks, and those server racks won't get updates at the same time. (it is a little more complex than that, but that's the basics!)
Azure Web Apps do used a shared file storage. The best way to think about it is that all the instances of your app map to the same network share that have your files. So if you modify the files by any mean (e.g. FTP, msdeploy, git, ...), all the instances instantly get the new files (since there is only one set of files).
And to answer your final question, each instance does run on a separate VM.

Syncing clocks on multiple Azure VMs

I have a requirement to write a load test measuring message transmission latencies. In order to simulate a large number of simultaneous uses without running into thread contention problem on one box, I'm spinning up multiple servers in Azure.
When I got my first results back, I was a little shocked to see that the results indicated the message was received before it was sent. I immediately realized that, while I had an implicit assumption that all the VMs would have their clocks synced to within milliseconds, that was clearly not the case.
I've spent several hours googling ways to resolve this, and I'm not getting anywhere. One thought was to have each VM query the time on a central server using NetRemoteTOD() using a technique similar to this NetRemoteTOD, and then establish a per-machine correction factor to be added to the time measured from the local machine's clock. However when I tried to run that method, I got a error 2184, "The service has not been started" I have verified that both the RPC service and the Windows Time service are running on the both the client and target machines, and I have not been successful in finding any information indicating what other service needs to be running (or even if the error really means what it seems to mean). (I also get the same error when running between my development desktop and a server on our corporate network. However, I can run it successfully to a PDC on the corporate network - but I can't find a PDC on Azure, since neither machine is part of a domain.)
So, does any one have either any information on what service needs to be started to get NetRemoteTOD (or the windows NET TIME command, which relies on NetRemoteTOD under the covers) working. Alternatively, does anyone have a suggestion for some other technique to get a consistent time reference across multiple VMs in Azure? (Note, I don't necessarily need their clocks synced, I just need a way to establish a consistent correction factor to reference the times to a common source. Note also, I need sub-second accuracy - probably about 100 msec will do.) Basically, I just need a windows function or shell command that will get me the time to sub-second accuracy on a given remote server.
Thanks in advance.
PS. Azure servers are running Server 2008 R2 SP1

Impossible to do a flawless Azure VIP swap without any SQL errors

When the site is being used by already a low amount of +5 users and I do a VIP swap I always get SQL errors, like:
System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: Session Provider, error: 19 - Physical connection is not usable)
I am using the SqlAzureExecutionStrategy. Also other requests during the VIP swap seem to take around ~30 seconds making the site really not responsive, despite that the staging environment is already fully warmed up.
Is there any way to prevent this?
Why don't I read much about this behaviour? Do others just don't care for a few broken requests and the slow 30s timeframe, or am I missing something essential in my Sql / EF settings?
You are literally swapping out live code while a site is actively in production. You are definitely going to get a variety errors. There is always going to be a split second where the services do not exist. Same thing when you deploy a new sql schema.
What you need to do is realize that this is going to happen and engineer around it. A strong dev ops team really shines here. What I personally do is to do staged deployments in different geographies. I will have traffic manager point all traffic over to the west coast web servers while I upgrade east coast. Then I point all traffic to east coast while I update west coast. I then set traffic management back to normal.
To allocate traffic from east to west, typically a single endpoint is set up with Traffic manager, which points behind the scenes to the correct endpoint, acting like a proxy. Just remove the endpoint you are about to upgrade, which will force a redirect to a live site. When the upgrade is done, add the endpoint back in.
You should also ensure that your code handles errors well. For example if needing to sync with external systems and during an upgrade a single record is entered into your system but not the external, can the external system's state be rebuilt with your data, or vice versa.
I finally took the step to switch to Azure Web Apps instead of Cloud Services and all these problems have been solved, and even more.
It took a few days of time but it's definitely worth it to switch.
Paying for accidently leaving a Staging server on is also a problem of the past.

Azure server not responding but dashboard reporting everything is running?

Very suddenly without any changes or recent access my Azure virtual server is no longer available for RDP or web...I have logged into the azure control panel and everything appears to running without issue but it is not working.
I have checked the end points and they are present for both RDP and Web, totally weird.
I have 2 virtual servers and the other one is working fine and responding.
Anyone ever experience this? Just when my client wants to view his website as well...
http://cn-web-02.cloudapp.net is the URL
TIA
As I just answered for this question, Virtual Machines are in Preview and not in Production yet. There are several reasons why your Virtual Machines became unavailable (see other answer). Given that this is the second reported incident here today, it's a good guess it's related to the underlying Host OS being updated, which would take your Virtual Machine offline for a short period of time.
I tried your URL and it's available again. Just remember about this being in Preview, especially since you mention having a client that wants to view his website. If you put a production website in Virtual Machines, then you'll have to absorb the risk of not having an SLA.
Having said that: You can mitigate downtime risk by running two Virtual Machines, listening on a load-balanced input endpoint. Be sure to have both Virtual Machines in the same Availability Set. Doing that ensures that the Windows Azure fabric controller will not take both Virtual Machines offline at the same time when doing things like Host OS updates. If this were in Production, you'd then have a very high availability scenario. Even in Preview, you'll improve availability by taking advantage of Availability Sets. Note: You'll need to use some type of shared session cache, since visitors will now be sent to either one of your Virtual Machines.
I had same experience on it! We had 2 instances and all of its were re-imaged without any notified. I known it since we made some local change via RDP.
Reboot or Reimage may help! You may try!
Turns out it was an outage from Microsoft...for over 22 hours but everything is back up and running. This is the 2nd time in 6 months this has happened for long stretches...makes me a little nervous to say the least.
Thanks for the input everyone and for anyone that's interested MS have a good site that tracks the service levels on Azure. Windows Azure Service Dashboard
S

Resources