Azure Pipelines fails in Deployment Slots - azure

I have created my release pipeline with few stages (DEV, QA, Production) where the Production App Service has a Deployment Slot with Auto Swap Enabled. However when I perform the release, it fails in the swapping slot tasks with the below error message. Have gone through many articles available in google and stack overflow but doesn't seem to help. Any pointers on what could be wrong would be very much helpful.
2021-08-18T16:30:41.0295503Z ##[error]Error: Failed to swap App Service 'jdmessaging' slots - 'preprod' and 'production'. Error: Conflict - Cannot modify this site because another operation is in progress. Details: Id: 32473596-226d-46b4-9c98-31285c27418e, OperationName: SwapSiteSlots, CreatedTime: 8/18/2021 4:28:43 PM, WebSystemName: WebSites, SubscriptionName: 74d83097-e9c9-4ca7-9915-7498a429def4, WebspaceName: DEMO-CentralUSwebspace, SiteName: jdmessaging, SlotName: preprod, ServerFarmName: , GeoOperationId: (null) (CODE: 409)
Note: For the first time, the release happened successfully with Deployment Slots. However, now we are trying the second release and encountered this issue.

This issue seems like more of the scenario,
One Operation triggered was yet to complete, meanwhile another operation was trigged on the same site (site modification)
Second operation was waiting for first operation to complete on the same and ultimately the second operation failed
Suggestion:
Wait for sometime and re-try the operation. It should succeed.
If still failed, please create a technical support ticker by following the link where technical support team would help you in troubleshooting the issue from platform end.

Related

Azure Function slot warmed but still experiences cold start

For our Azure function we use the auto-slot-swapping feature with the following appsettings to ensure our slot is warmed before going live:
WEBSITE_OVERRIDE_PRESERVE_DEFAULT_STICKY_SLOT_SETTINGS = 1
WEBSITE_SWAP_WARMUP_PING_PATH = "/api/healthcheck"
WEBSITE_SWAP_WARMUP_PING_STATUSES = "200"
This results in our ADO pipeline calling the healthcheck endpoint (confirmed), and only swapping the slot to live if it's successful.
The problem is that after all this takes place, there's a wait of many seconds to a request before we receive a response. Any request thereafter is virtually instant. This behaviour is consistent for every deploy.
We would not expect this, because we know the Staging slot is warmed when the healthcheck endpoint is hit, before the slot is then swapped into Production. So why do we experience this cold start delay? We can even wait a minute or two after the slot swapping has completed, and we always experience it.
Is there something odd happening, like once the slot is moved into Production, it needs to be hit again before it's warmed?
This may help you.
After slot swaps, the app may experience unexpected restarts. This is because after a swap, the hostname binding configuration goes out of sync, which by itself doesn’t cause restarts. However, certain underlying storage events (such as storage volume failovers) may detect these discrepancies and force all worker processes to restart. To minimize these types of restarts, set the WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG=1 app setting on all slots
If you set variable WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 you should be able to get rid of cold starts, which are caused be restarting host machine. However, please be aware that during slot function may process requests very slowly.
You may also check this github issue where you find a discussion about zero downtime deployment.

Azure alert on Azure Functions "Failed" metric is triggering with no apparent failures

I want an Azure Alert to trigger when a certain function app fails. I set it up as a GTE 1 threshold on the [function name] Failed metric thinking that would yield the expected result. However, when it runs daily I am getting notifications that the alert fired but I cannot find anything in the Application Insights to indicate the failure and it appears to be running successfully and completing.
Here is the triggered alert summary:
Here is the invocation monitoring from the portal showing that same function over the past few days with no failures:
And here is an application insights search over that time period showing no exceptions and all successful dependency actions:
The question is - what could be causing a Azure Function Failed metric to be registering non-zero values without any telemetry in Application Insights?
Update - here is the alert configuration
And the specific condition settings-
Failures blade for wider time range:
There are some dependency failures on a blob 404 but I think that is from a different function that explicitly checks for the existence of blobs at paths to know which files to download from an external source. Also the timestamps don't fall in the sample period.
No exceptions:
Per comment on the question by #ivan-yang I have switched the alerting to use a custom log search instead of the built-in Azure Function metric. At this point that metric seems to be pretty opaque as to what is triggering it and it was triggering every day when I ran the Azure Function with no apparent underlying failure. I plan to avoid this metric now.
My log based alert is using the following query for now to get what I was looking for (an exception happened or a function failed):
requests
| where success == false
| union (exceptions)
| order by timestamp desc
Thanks to #ivan-yang and #krishnendughosh-msft for the help

What might cause the 'InternalServerError executing request' when running a manually triggered pipeline?

The setup of the pipeline is a simple import from a .csv file stored in Azure Blob Storage to an Azure SQL database table.
When I run the pipeline in Debug by using the 'Debug' button in the portal, the job finishes in 8 seconds.
When I run the pipeline with the Add trigger\Trigger now button it runs for 20+ minutes and fails with the error 'InternalServerError executing request'.
I recreated the pipeline and components from scratch and tried using a Data Flow (Preview) and a Copy Data, both give the same result.
The expected output is a successful run of the pipeline, the actual output is the 'InternalServerError executing request' error.
The problem was with source control, which we recently enabled. The 'Add trigger\Trigger now' uses the published version of the pipeline. The Debug uses the currently saved version of the pipeline. The 20 minutes timeout and the 'InternalServerError executing request' is a poor way of saying: 'You did not publish your pipeline yet' :)
Just to add another possible cause in case someone else stumbles upon this:
I had the same error multiple times when I had many concurrent pipeline runs, in my case triggered by hundreds of new files in a OneDrive folder ("manually" triggering the pipeline via Azure Logic App). Some of the runs succeeded, some of them failed. When I reran the failed runs or loaded fewer files at once, it worked.
So the Data Factory might not be ready yet to handle parallel execution very well.
Just to add another possible cause in case someone else stumbles upon this:
Check if the data factory is down from the Resource Health tab.
I was getting Internal Server Error for all the sandbox runs.

Azure Functions: Application freezes - without any error message

I have an Azure Functions application which once in a while "freezes" and stops processing messages and timed events.
When this happens I do not see anything in the logs (AppInsight), neither exceptions nor any kind of unfamiliar traces.
The application has following functions:
One processing messages from a Service Bus topic subscription (belonging to another application)
One processing from an internal storage queue
One timer based function triggered every half hour
Four HTTP endpoints
Our production app runs fine. This is due to an internal dashboard (on big screen in the office), which polls one of the HTTP endpoints every 5 minutes, there by keeping it alive.
Our test, stage and preproduction apps stop after a while, stopping to process messages and timer events.
This question is more or less the same as my previous question, but the without error message that was in focus then. Much fewer error messages now, as our deployment has been fixed.
A more detailed analysis can be found in the GitHub issue.
On a consumption plan, all triggers are registered in the host, so that these can be handled, leading to my functions being called at the right time. This part of the host also handles scalability.
I had two bugs:
Wrong deployment. Do zip based deployment as described in the Docs.
Malformed host.json. Comments in JSON are not right, although it does work in most circumstances in Azure Functions. But not all.
The sites now works as expected, both concerning availability and scalability.
Thanks to the people in the Azure Functions team (Ling Toh, Fabio Cavalcante, David Ebbo) for helping me out with this.

WaWorkerHost.exe crashes role: CallbackException

When I run my WorkerRole C# application on Azure, after a while waworkerhost.exe crashes due the following exception:
Application: WaWorkerHost.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Runtime.CallbackException
Stack:
at System.Runtime.Fx+IOCompletionThunk.UnhandledExceptionFrame(UInt32, UInt32, System.Threading.NativeOverlapped*)
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
I have an application that generates load to a webserver. I don't care about the actual response, but i want to control the number of requests made per second.
Therefore i have a Timer that fires every second and generates a number of requests. I have tried the following options:
Parallel.For with WebRequests
For loop with ASync WebRequests
For loop with ThreadPool.QueueUserWorkItem(do
webrequest)
When the number of requests increase, the exception occurs (8+ req/sec). The same exception for all three options. When I run the role in local DevelopmentFabric all three options work just fine. If someone could give me some pointers on what might be going wrong I appreciate it. If you have other ideas to generate this type of load from Azure and C#, please share your thoughts.
The author answered the question in the comment to the original post, but for better visibility, I'm reporting it to here:
Turn out to be an IntelliTrace issue, see
http://social.msdn.microsoft.com/Forums/en-ZA/windowsazuretroubleshooting/thread/543da280-2e5c-4e1a-b416-9999c7a9b841:
...
After redeploying my solution with Intellitrace disabled, the issues
where resolved, and my WorkerRole stayed healthy.

Resources