Azure DevOps Self hosted agent error connectivity issues - azure

We are using Azure DevOps Self hosted agents to build and release our application. Often we are seeing
below error and recovering automatically. Does anyone know what is this error ,how to tackle this and where to exactly check logs about the error ?
We stopped hearing from agent <agent name>. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink?Linkid=846610

This seems to be a known issue with both self-hosted and Microsoft-hosted agents that many people have been reporting.
Quoting the reply from #zachariahcox from the Azure Pipelines Product Group:
To provide some context, the azure pipelines agent is composed of two
processes: agent.listener and agent.worker (one of these per
step in the job). The listener is responsible for reporting that
workers are still making progress. If the agent.listener is unable
to communicate with the server for 10 minutes (we attempt to
communicate every minute), we assume something has Gone Wrong and
abandon the job.
So, if you're running a private machine, anything that can interfere
with the listener's ability to communicate with our server is going to
be a problem.
Among the issues i've seen are anti-virus programs identifying it as a
threat, local proxies acting up in various ways, the physical machine
running out of memory or disk space (quite common), the machine
rebooting unexpectedly, someone ctrl+c'ing the whole listener process,
the work payload being run at a way higher priority than the listener
(thus "starving" the listener out), unit tests shutting down network
adapters (quite common), having too many agents at normal priority on
the same machine so they starve each other out, etc.
If you think you're seeing an issue that cannot be explained by any of
the above (and nothing jumps out at you from the _diag logs folder),
please file an issue at
https://azure.microsoft.com/en-us/support/devops/
If everything seems to be perfectly alright with your agent and none of the steps mentioned in the Pipeline troubleshooting guide help, please report it on Developer Community where the Azure DevOps Team and DevOps community are actively answering questions.

Related

Are Azure Pipelines designed to support long-running jobs (running for days, potentially weeks)?

I have a long-running Java/Gradle process and an Azure Pipelines job to run it.
It's perfectly fine and expected for the process to run for several days, potentially over a week. The Azure Pipelines job is run on a self-hosted agent (to rule out any timeout issues) and the timeout is set to 0, which in theory means that the job can run forever.
Sometimes the Azure Pipelines job fails after a day or two with an error message that says "We stopped hearing from agent". Even when this happens, the job may still be running, as evident when SSH-ing into the machine that hosts the agent.
When I discuss investigating these failures with DevOps, I often hear that Azure Pipelines is a CI tool that is not designed for long-running jobs. Is there evidence to support this claim? Does Microsoft commit to only support running jobs within a certain duration limit?
Based on the troubleshooting guide and timeout documentation page referenced above, there's a duration limit applicable to Microsoft-hosted agents, but I fail to see anything similar for self-hosted agents.
Agree with #Dianel Mann.
It's not common to run long-time jobs, but as per doc, it should be supported.
stopped hearing from agent could be caused by network problem on the agent, or agent issue due to high cpu, storage, ram...etc. You can check the agent diagnostic log to troubleshoot.

Erlang connection reset

I have an api application that is running in a docker container, and since moving to AWS, the api stops daily with the error: Erlang closed the connection. I've monitored the server during that time and no IOPS seem to be causing this issue. Beyond that though, when the api fails, it won't restart on it's own on one of our clusters. I'm not sure where to find the logs to get more context and could use any input that may be helpful. Also, more context here, is that this api worked fairly well before in our data-center/physical server space, but now in AWS, it fails daily. Any thoughts or suggestions as to why this may be failing to restart?
I've looked at the syslogs and the application server logs and don't see any kind of failures. Maybe I'm not looking in the proper place. At this point, someone from one of our teams has to manually restart the api with an init.d command. I don't want to create a cron job to "fix" this because that's a band-aid and not a true fix.
There really isn't enough information on the structure of your app or its connections, so no one can give a "true fix".
The problem may be as small as configuring your nodes differently, changing some of the server's local configurations, or you may need some "keep alive" handler towards AWS.
Maybe try adding a manual periodic dump to see if its an accumulating problem, but I believe if Erlang closed the connection there's a problem between your API and AWS, and your API can't understand where it is located.

Syncing clocks on multiple Azure VMs

I have a requirement to write a load test measuring message transmission latencies. In order to simulate a large number of simultaneous uses without running into thread contention problem on one box, I'm spinning up multiple servers in Azure.
When I got my first results back, I was a little shocked to see that the results indicated the message was received before it was sent. I immediately realized that, while I had an implicit assumption that all the VMs would have their clocks synced to within milliseconds, that was clearly not the case.
I've spent several hours googling ways to resolve this, and I'm not getting anywhere. One thought was to have each VM query the time on a central server using NetRemoteTOD() using a technique similar to this NetRemoteTOD, and then establish a per-machine correction factor to be added to the time measured from the local machine's clock. However when I tried to run that method, I got a error 2184, "The service has not been started" I have verified that both the RPC service and the Windows Time service are running on the both the client and target machines, and I have not been successful in finding any information indicating what other service needs to be running (or even if the error really means what it seems to mean). (I also get the same error when running between my development desktop and a server on our corporate network. However, I can run it successfully to a PDC on the corporate network - but I can't find a PDC on Azure, since neither machine is part of a domain.)
So, does any one have either any information on what service needs to be started to get NetRemoteTOD (or the windows NET TIME command, which relies on NetRemoteTOD under the covers) working. Alternatively, does anyone have a suggestion for some other technique to get a consistent time reference across multiple VMs in Azure? (Note, I don't necessarily need their clocks synced, I just need a way to establish a consistent correction factor to reference the times to a common source. Note also, I need sub-second accuracy - probably about 100 msec will do.) Basically, I just need a windows function or shell command that will get me the time to sub-second accuracy on a given remote server.
Thanks in advance.
PS. Azure servers are running Server 2008 R2 SP1

Does azure force kill processes by itself? My nodeJS/Java processes/Jmeter are force killed

I am using windows azure for a performance test in about 8 nodes, each running a different application. Since its a performance test, we do have quite a bit of traffic generated.
The test was running just fine for a few hours. Then suddenly we realise a few of the applications like nodeJS, JMeter and even Java processes have been force-killed. Each at a different time.
We find nothing in logs that indicate a out of memory or any other error or application issue. And this happens pretty often, once every few hours. For example we had seen this issue with jmeter shut down once every 3-4 hours and then once it had happened after 10 hours or continuous run.
So we suspect azure is using root permissions to force-kill the above processes.
Did any of you notice this with your applictaions on azure and do you know why?
Short answer, no, Azure does not kill your processes. There is no such thing as 'root permissions' to kill specific processes.
Are you running an IaaS VM or a PaaS Web/Worker Role? For PaaS, check out http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx for where to start getting diagnostic data. For IaaS, troubleshoot it like you would on-prem (DebugDiag, WinDBG, procmon, Application/System event logs, etc) since there is really nothing specific about Azure that would cause this behavior.

Windows Azure: Unexpected & unclean virtual-machine shutdown

Using a large instance of a virtual machine on Windows Azure. The instance runs Microsoft SQL 2012 with light usage, on Windows Server 2012 + all up to date. No user is logged in at time of failures.
However, several (between none and three) times a day (appears random), the VM halts and shuts down. It does not come back online until someone logs back into the Management Portal and starts the VM again. There is no memory dump created. So I am guessing the host halts the running VM, rather than some configuration instance within the guest OS causes the halt. The subscription has billable funds. Other VMs in the subscription are also affected.
Only event logs generated:
Kernel-Power logged:
The system has rebooted without cleanly shutting down first. This
error could be caused if the system stopped responding, crashed, or
lost power unexpectedly.
Kernel-Boot logged:
The last shutdown's success status was false. The last boot's success
status was true.
How can this be resolved? There is no way to initiate a support request within Azure.
The first point I would do is install some monitoring software like newrelic or foglight and see if you can see if you are running out memory or a process is pushing the CPU into a spin.
This will give you some visibility of the activity on the box over time and give you some evidence should you need it to open a support request.
Azure now has paid support only
http://www.windowsazure.com/en-us/support/plans/
We use developer for exactly this type of situation where you are bit lost to figuring out a situation the cost of $30 dollars compared to running a SQL Server 2012 VM per month makes it worth having. The support under Microsoft are generally very good and they will have more diagnostic information and will be able to give you the heads up if this is because of Azure failure or something else.
Getting diagnostic going though would be first port of call then you can see what is going on and get some evidence together and help you track down the problem.

Resources