At some point, Azure DevOps pipelines started to return "timeout" for helm tasks, after expending 3+ mins executing the task.
UPGRADE FAILED: timed out waiting for the condition (see the image).
It was somehow misleading because the error was happening at the same time they had an incident with some of the services in Europe.
The cause of this error was the lack of capacity in the target AKS cluster. Apparently, the task timed out while waiting for some provisioning to happen or something like that.
Related
I setup a Databricks instance on Azure using terraform. The deployment seems to be good. But, I am getting the following error when creating/starting a new cluster,
Message
Cluster terminated. Reason: Cloud provider launch failure
Help
A cloud provider error was encountered while launching worker nodes.
See Knowledge base: Cloud provider initiated terminations for remediations.
Search for error code NetworkingInternalOperationError
Details
Azure error code: NetworkingInternalOperationError
Azure error message: An unexpected error occured while processing the network profile of the VM. Please retry later.
Any idea why this is happening?
Usually such errors are returned when there are temporary problems with underlying VM infrastructure. Usually they mitigated very fast, so you just need to try later, although it makes sense to check Azure Databricks and Azure status pages - they may show if outage is in progress.
I am using azure YAML pipelines to deploy my .Net application. Generally, a deployment uses to complete in 30 min for one server.
But, for the past three days, it's taking almost 90 mins for deployment.
My org network is good and in the Azure status check pipelines are in advisory mode with below message "expect start time delays up to 30 minutes for macOS hosted pipelines during peak hours"
Is there any recent update from Microsoft related to YAML pipeline performance?
You could check this page: Azure DevOps Status, if Azure DevOps has event issue, we could check latest news in the Status page.
And I found this event issue: Advisory: expect start time delays up to 30 minutes for macOS hosted pipelines during peak hours
The latest news and workaround is: Customers will continue to see daily queueing during peak hours for macOS pools on Azure Pipelines due to a capacity shortage. We’re adding additional capacity over the next 3 months. Until the shortage is resolved, customers should expect queueing during peak hours.
As a workaround, customers can add self-hosted macOS agents to add new capacity for running their Pipelines: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/v2-osx?view=azure-devops
I have a problem with azure and it's kubernetes environment. Form to time calls to k8s API are failing and when it happened the pod which experiencing the issue stops responding (more like the network issue than application hanging), for a calls from other pods, but health check is working, so k8s is not restarting the pod. The only way to restore it is a deletion of the pod.
Here is a part of the stacktrace.
Failed to get list of services from DiscoveryClient org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient#7057dbda due to: {}
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list] for kind: [Service] with name: [null] in namespace: [dev] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:602)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:63)
at org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient.getServices(KubernetesDiscoveryClient.java:253)
at org.springframework.cloud.kubernetes.discovery.KubernetesDiscoveryClient.getServices(KubernetesDiscoveryClient.java:249)
.....
Caused by: java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
What I know now:
issue happens on azure
doesn't depend on the load (also the actual calls are background jobs and they are also not depend on the load, so this is expected)
doesn't depend on the number of services deployed or frequency of the calls to k8s API (so actual traffic to k8s API from cluster doesn't matter)
it is very selective: if one service/replica is affected other can work without issues
if affected pod is restarted quickly (we have a job to automate restarts) then problem tends to jump to other service
Azure support says it is problem with our apps, which are build on spring boot and it's auto discovery mechanism, but I am starting doubt it.
Basically it looks like pod is partially lost by k8s engine.
So the question is what is wrong and what else can I check?
I have been using Azure Data Factory V2 for a while to execute SSIS packages from the SSISDB catalouges.
Today (16-11-2018) I have encountered "Unexpected Termination" Failure message without any Warning and Error message.
Things than I have done:
Executing the SSIS package manually from the SSISDB catalogue in SQL Server Management Services (SSMS). What i have noticed is that it took an exceptionally long time to assign the task to a machine. When the package is assigned to a machine, within 1 or two minutes it throws back the Failure message.
There are 3 SSIS packages that is excecuted "sequentially" with the Azure Data Factory Pipeline. Often the 1st package is executed successfully, however the 2nd and 3rd package never succeded.
Another error message that I got is "Failed pull task from SSISDB, please check if SSISDB has exceeded its limit".
I hope anyone can help me with this issue. I have been searching the web and could not find anything on this subject.
What tier of Azure sql server have you provisioned for the SSISDB to run on? If its too small, it may be taking too much time starting and throw a timeout.
Personally, I've had no problems provisioning an S3 Azure Sql Server.
Hope this helped!
Martin
I created a runbook to process multiple partitions of my cube. I launched it and 3 hours later, it stops. There is no error message or warning and nothing in the output pane. The only message I have is this one
Exception
The job was evicted and subsequently reached a Stopped state. The job cannot continue running
I have absolutely no idea of the reason why it stopped.
Any idea ?
Azure Automation runbook can only run for 3 hours.
In order to share resources among all runbooks in the
cloud, Azure Automation will temporarily unload any job after it has
been running for three hours. During this time, jobs for
PowerShell-based runbooks are stopped and are not be restarted. The
job status shows Stopped. This type of runbook is always restarted
from the beginning since they don't support checkpoints.
https://learn.microsoft.com/en-us/azure/automation/automation-runbook-execution#fair-share