Container stuck with no status and no events - azure

I am using azure container instance for my services, but after some days of working the containers getting stuck and azure can not find them (return 404).
After the problem happen the service still has a public IP but nothing else.
All metrics show none usage and in the event tab there is an error that say "no events in the last hours".
I can not even connect to the container because it is say that the container does not exist.
The containers status contain nothing (not stopped, just dots like there is not value) and I can not start the containers but can stop them (witch is wired because azure complain that the does not exist).
After I stop the containers all the data (status and events) show the values of the last manual stop.
I have tried to run the docker in a regular server with out any problems. I also deny resources problem by monitor the resource usage.

Related

Heroku - restart on failed health check

Heroku does not support health checks on its own. It will restart services that crashed, but there is nothing like health checks.
It sometimes happen that service become unresponsive, but the process is still running. In most of modern cloud solution, you can provide health endpoint which is periodically called by the cloud hosting service and if that endpoints return either error or not at all, it will shut down such service and start new one.
That seems like industrial standard these days, but I am unable to find any solution to this for Heroku. I can even use external service with Heroku CLI, but just calling some endpoint is not sufficient - if there are multiple instances, they all share same URL and load balancer calls one of them randomly -> therefore it is possible to not hit failed instance at all. Even when I hit it, usually the health checks have something like "after 3 failed health checks in a row restart that instance", which is highly unprobable if there are 10 instances and one of it become unhealthy.
Do you have any solution to this?
You are right that this is industry standard and shame that it's not provided out of box.
I can think of 2 solutions (both involve running some extra code that does all of this:
a) use heroku API which allows you to get the IP of individual dynos, and then you can call each dyno how you want
b) in each dyno instance you can send a request to webserver like https://iamaalive.com/?dyno=${process.env.HEROKU_DYNO_ID}

Docker Container in Azure Logic App fails does not exit properly

I have a curious issue getting a docker container set up to run and exit properly in an Azure logic app.
I have a python script that prints hello world, then sleeps for 30 minutes. The reason for sleeping is to make the script run longer so that I can test if the container in the logic app exits at the right moment, when the script is done running and not when the loop times out.
First, I confirm that the container is running properly and exiting properly in powershell:
PS C:\Users\cgiltner> docker run helloworld
Running 'Hello World' at 2019-11-26 17:53:48
Hello, World!
Sleeping for 30 minutes...
Awake after 30 minutes
PS C:\Users\cgiltner>
I have the container set up in a logic app as follows, there is an “Until” loop that is configured to run until “State” = “succeeded”
But when I run it, the “until” loop continues for 1 hour, which is the default timeout period for an until loop (PT1H)
Looking at the properties of the container, I can see that the state of the container never changed from “Running”
Just to clarify, the container IS running and executing the script/docker container successfully. The problem is that it is not exiting when the script is actually done, rather it is waiting until the timeout period is done. There is not an error message or any failure indicating that it times out, it just simply moves to the next step. This has big implications in a complex logic app where multiple steps need to happen after containers run, it causes things to take hours in the app.
For your issue, what you need to know first is that your first action of the Logic App is creating the Azure Container Instance, but when the Logic App action has done, the creation of the Azure Container Instance still be not finished. It only returns a pending state and will not update. In your second action, you expect the succeeded state in the Until action. So the result is that the action will delay until timeout.
The solution is that you need to add a pure delay action behind the creation of the Azure Container Instance. Then add the action to get the properties and logs of the containers in the container group.

App Service unavailable for unknown reason

We're running App Service for Linux in docker container.
When things work, they work really good. But, occasionally, our site becomes unavailable for unclear reason.
Our health status reports looks like this:
Now, after some time, the app becomes completely unavailable. Health check reports Available, but in out docker log we find records like this:
2017-11-18 08:01:50.060 ERROR - Container for --- site ---is unhealthy. Stopping site.
2017-11-18 08:32:49.295 INFO - Issuing docker login to sever: http://---
2017-11-18 08:32:49.837 INFO - docker login to http://--- succeeded
2017-11-18 08:32:49.858 INFO - Issuing docker pull ---
2017-11-18 08:39:49.096 INFO - docker pull returned STDOUT>> 40: Pulling from ---
The only thing that helps is restarting the app. Then it comes back to normal and all works as expected.
I emphasise, site doesn't hang on every 'Unavailable' report from the Health check. It hangs randomly. CPU/Memory are at normal levels, nothing unusual there and no crasy spikes.
Application itself has general exceptions filter and no uncaught exceptions go out of app.
Any ideas why it might happen?
Depending on the site of your docker image, the application goes offline while it's pulling and initializing the new image. I noticed that our deploy took nearly 20 minutes before coming back up.

Azure Service Fabric Activation Error 7148

I have a service fabric cluster which hosts numerous applications. One of the applications has a service type where the service is created, runs for a bit, and then is deleted. Everything works great, but the cluster virtually always has its state set to error because there will be a few of these in the "Unhealthy evaluations" section.
Error event: SourceId='System.Hosting', Property='CodePackageActivation:Code:EntryPoint'.
There was an error during CodePackage activation.The service host terminated with exit code:7148
I've wrapped both the program's main and RunAsync in exception handlers, but never see anything in analytics. Is there any way to look up what exit code 7148 means? Thanks.
7148 is a general error code that indicates that something failed in SF in the process of setting up or activating your service's host process. So that's the reason that you're not seeing any errors or exceptions - your code is never getting a chance to run.
Examples of things I've seen that led to 7148:
The exe was not actually a windows exe due to corruption
The service's manifest had a reference to a cert or some other pre-req like an endpoint that was incorrectly configured (like a port that was already in use or the wrong thumbprint for a cert)
Something blew up inside Windows that cause the process creation to fail, like a failure to correctly configure host networking for a container
Most of the times when I see this I have to look at the windows error logs to see what's really happening. The SF folks are also trying to capture more common causes of failures and reporting them as better health errors rather than relying on 7148.

Azure GetMachineGoalState dying from HTTP 410 error

My Azure web role is constantly recycling.
The WaAppAgent.log file on my Azure web role contains a whole stream of these errors:
[00000008] [05/15/2012 00:10:20.90] GetMachineGoalState() failed with exception: Microsoft.ServiceModel.Web.WebProtocolException: Server Error: Gone (Gone) ---> System.Net.WebException: The remote server returned an error: (410) Gone.
at System.Net.HttpWebRequest.GetResponse()
at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)
--- End of inner exception stack trace ---
at Microsoft.ServiceModel.Web.WebHttpChannelProxy`1.Invoke(IMessage msg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.WindowsAzure.RoleContainer.Protocol.IControlSystem.GetMachineGoalState()
at Microsoft.WindowsAzure.GuestAgent.ContainerStateMachine.ControlSystem.GetGoalState().
[00000008] [05/15/2012 00:10:20.90] Caught exception in pre-initialization heartbeat thread, will continue heartbeats: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.WindowsAzure.GuestAgent.ContainerStateMachine.ContainerStateManager.InitialHeartbeatThread(Object waitEvent)
[00000005] [05/15/2012 00:10:23.24] Agent runtime initialized.
The Azure service status dashboard says everything is green on their side, but it always says that unless the entire platform is down in a smoldering heap so I don't really trust it at all.
Is this my problem, or theirs?
I take your statement that you are using Full IIS based Windows Azure Web Role. I have seen such error with VM Role but with Web Role it seems very strange. I wouldn't weight much on your logs either at this time because it may mislead to wrong direction.
If your Web role status shows green (Ready) state but your site is not available then issue could be related to application itself because green or ready state means the role host process (in this case waiishost.exe) is healthy. It would be rare occurrence when role status is green and your role host process is unhealthy. Based on what you have provided either there is an issue within the VM startup sequence means your role not even started however the portal must not show ready state.
Please log into your Azure VM using RDP and..
Check first that WaIISHost.exe process is running or not.. keep an eye in this process for 2 minutes to check if this process is crashing and restarting again
Please check application event log because it should have some pattern of exception to find the root cause
Please check recent Azurebootstrapper and iisconfigurator logs located in C:\logs folder for any specific issue during VM boot and IIS start respectively.
At last please backup the logs (drive C:\logs and C:\Resources), please try rebooting your instance. Still have problem contact Windows Azure support team:
https://www.windowsazure.com/en-us/support/contact/

Resources