App Service unavailable for unknown reason - azure

We're running App Service for Linux in docker container.
When things work, they work really good. But, occasionally, our site becomes unavailable for unclear reason.
Our health status reports looks like this:
Now, after some time, the app becomes completely unavailable. Health check reports Available, but in out docker log we find records like this:
2017-11-18 08:01:50.060 ERROR - Container for --- site ---is unhealthy. Stopping site.
2017-11-18 08:32:49.295 INFO - Issuing docker login to sever: http://---
2017-11-18 08:32:49.837 INFO - docker login to http://--- succeeded
2017-11-18 08:32:49.858 INFO - Issuing docker pull ---
2017-11-18 08:39:49.096 INFO - docker pull returned STDOUT>> 40: Pulling from ---
The only thing that helps is restarting the app. Then it comes back to normal and all works as expected.
I emphasise, site doesn't hang on every 'Unavailable' report from the Health check. It hangs randomly. CPU/Memory are at normal levels, nothing unusual there and no crasy spikes.
Application itself has general exceptions filter and no uncaught exceptions go out of app.
Any ideas why it might happen?

Depending on the site of your docker image, the application goes offline while it's pulling and initializing the new image. I noticed that our deploy took nearly 20 minutes before coming back up.

Related

My "spring boot application' cant be accessed but cpu and heap usage is normal

Background:
I have a kubernates cluster in my cloud server,And i deployed spring boot application as a pod .Everything is ok before 2 week , but my application unexpectedly could not be accessed on Ferb 22.
I used command "kubectl exec -it <pod> sh ",curl 127.0.0.1:<port> in pods but no response .I saw my application logs but cant found error about this issue.I try to restart my applications ,but the same issue occur after two days.
I have no idea with this issues.Can any one help me ?
When everything ok ,i can call 127.0.0.1:18890 and have response immediately ,once issues happend it will be requested timeout .Only this kubernates service have this question ,others seems normal

Azure batch pool crashing after running steadily for some time

I am encountering the following behavior with Azure Batch. I am using Shipyard to start a pool of 500 low-priority nodes to perform a list of 400.000 tasks. The pool size is managed using auto-scaling.
At first, the pool seems to be running just fine. The number of nodes increases to maximum capacity and the tasks complete as expected. However, after some time (having completed a sizable amount of tasks), I start to encounter 'start task failed' errors. The pool then quickly starts degrading until all nodes crash due to this same error.
This is the error I get in the stdout.txt file of one of the crashed nodes:
Login Succeeded
2020-03-04T09:09:07UTC - INFO - Docker registry logins completed.
2020-03-04T09:09:07UTC - WARNING - No Singularity registry servers found.
2020-03-04T09:13:37,840996225+00:00 - ERROR - Cascade Docker exited with non-zero exit code: 1
This seems to be an issue related to pulling the Docker image? Although it worked without issue on other nodes before.
I am aware that this is not a lot of information to go on, but I am having trouble figuring out what information is relevant and what's not.
UPDATE
After updating to shipyard 3.9.1, this is the output in stdout.txt for one of the crashed nodes (start task failed):
2020-03-05T08:23:43,784166638+00:00 - DEBUG - Pulling Docker Image: mcr.microsoft.com/azure-batch/shipyard:3.9.1-cargo (fallback: 0)
2020-03-05T08:23:58,876629647+00:00 - ERROR - Error response from daemon: Get https://mcr.microsoft.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-03-05T08:23:58,878254953+00:00 - ERROR - No fallback registry specified, terminating
Please see the GitHub issue https://github.com/Azure/batch-shipyard/issues/340. You will likely need to upgrade your Batch Shipyard version and recreate your pool.

Container stuck with no status and no events

I am using azure container instance for my services, but after some days of working the containers getting stuck and azure can not find them (return 404).
After the problem happen the service still has a public IP but nothing else.
All metrics show none usage and in the event tab there is an error that say "no events in the last hours".
I can not even connect to the container because it is say that the container does not exist.
The containers status contain nothing (not stopped, just dots like there is not value) and I can not start the containers but can stop them (witch is wired because azure complain that the does not exist).
After I stop the containers all the data (status and events) show the values of the last manual stop.
I have tried to run the docker in a regular server with out any problems. I also deny resources problem by monitor the resource usage.

Azure Service Fabric Activation Error 7148

I have a service fabric cluster which hosts numerous applications. One of the applications has a service type where the service is created, runs for a bit, and then is deleted. Everything works great, but the cluster virtually always has its state set to error because there will be a few of these in the "Unhealthy evaluations" section.
Error event: SourceId='System.Hosting', Property='CodePackageActivation:Code:EntryPoint'.
There was an error during CodePackage activation.The service host terminated with exit code:7148
I've wrapped both the program's main and RunAsync in exception handlers, but never see anything in analytics. Is there any way to look up what exit code 7148 means? Thanks.
7148 is a general error code that indicates that something failed in SF in the process of setting up or activating your service's host process. So that's the reason that you're not seeing any errors or exceptions - your code is never getting a chance to run.
Examples of things I've seen that led to 7148:
The exe was not actually a windows exe due to corruption
The service's manifest had a reference to a cert or some other pre-req like an endpoint that was incorrectly configured (like a port that was already in use or the wrong thumbprint for a cert)
Something blew up inside Windows that cause the process creation to fail, like a failure to correctly configure host networking for a container
Most of the times when I see this I have to look at the windows error logs to see what's really happening. The SF folks are also trying to capture more common causes of failures and reporting them as better health errors rather than relying on 7148.

Slow response times from free web app server every day at same time

Every day at about 3:00PM-4:00PM GMT the response times start to increase (no memory increase or CPU increase)
There is a azure availability test going to server every 10 minutes.
As this is a dev site there is no traffic to it other than me (at the odd time) and the availability test
I log to a variable internally the startup time and this shows that the site is not restarting
The first request via a browser when this starts happening is very slow (2 minutes - probably some timeout).
After that it runs perfectly. That seems like the site is shutting down and then starting up on first request, but the pings are keeping it alive so the site is not shutting down (as far as I know)
On the odd log entry I get - I seem to be getting 502 errors - but I can't confirm this as the FEEB logs are usually off at this time.
FREB logs turn off automatically after 1 hour and as this is the middle of the night for me (NZDT) - I don't get a chance to turn on.
See attached images - as you can see the response times just increase at same time
Ignore the requests where they are above 20 - thats me going to it via browser
I always check the azure dashboard BEFORE viewing site in browser
Just got this error (from web browser randomly - keep accessing the same page:
502: The specified CGI application encountered an error and the server terminated the process.
Other relevant Info (Perhaps):
I initially had the availability test ping going to a ping endpoint /ping that only returned a 200 and empty string when I noticed this happening
It now points to the sites homepage to see if it changed anything - still the same.
Assuming the database is not the issue as the /ping endpoint doesn't touch the database - just a straight controller return.
Internal Exception handling is catching nothing
Service: Azure Free Web App (Development)
There are no web jobs or timed events on this site
Azure Dashboard Initial
Current tests:
Uploading as new site to a Basic 1 Small
Restarting dev site 12 hours before issues (usually 20 hours before)
Results:
Restarting free web-app 12ish hours before issue - same result at same time - so its not the app slowly overloading or it would me much later
Basic 1 Small: no problems - could it be something with the dev server ?
Azure Dashboard From Today
Observations:
Same behavior with /ping endpoint (just return empty string 200 Ok) and Main home page endpoint (database lookups [w/caching] / razer)
If anyone has any ideas what might be going on - I would very much appreciate it
:-)
Update:
It seems to of stopped (on its own) about 11/1/2016 1:50:49 AM GMT - my internal timestamp says it restarted - and then the errors started again same time as usual. Note: no-one is using the app. The basic 1 Small Server is still going fine.
Sorry I can't add anymore images (not enough rep)
By default, web apps are unloaded if they are idle for some period of time, which could cause the web site slow response during this period of time. Besides, this article is about troubleshooting HTTP "502 Bad Gateway" error or a HTTP "503 Service Unavailable" error in Azure web apps, you could read it. And from the article we could know scaling the web app could mitigate the issue.

Resources