One of our Azure Logic Apps is experiencing a bizarre gap in timing from time of invocation to the time it actually begins executing. This is resulting in extremely long and erroneous run lengths. Here is an example of a problematic run:
Start date: Friday, December 13, 2019, 1:45:42 PM
End date: Friday, December 13, 2019, 2:24:30 PM
Run Duration: 38.79 minutes
Check_if_journal_file 0 Milliseconds
Condition 0 Milliseconds
Delay 1.52 Seconds
Get_Blob_Metadata 15 Milliseconds
HTTP 281 Milliseconds
ImportBreakpointJournalFile 406 Milliseconds
Initialize_JobHttpStatusCode 110 Milliseconds
Journal_file_failed_to_process 0 Milliseconds
Journal_file_successfully_processed 31 Milliseconds
Not_a_journal_file 0 Milliseconds
Set_variable 157 Milliseconds
Until 4.69 Seconds
Actual Action Execution Duration: 7.21 seconds
Action Execution Graph
Since the execution itself only took ~7 seconds, that means the logic app was just sitting there idling doing nothing for almost 38 minutes! Runs previous to this one show no timing problems.
Has anyone else seen behavior like this?
What would cause a logic app to idle for 38 minutes before beginning execution?
UPDATE:
Turned on detailed diagnostics for the logic app and got the following results which show that it really is idling or suspended. You can see that job 08586252374669322278104928528CU37 begins at 12:23 then almost immediately suspends? before taking any action and then resumes at 12:58 for no apparent reason. After resuming you can see it begins executing normally as Initialize_JobHttpStatusCode is the first action in the app.
TimeGenerated [UTC] startTime_t [UTC] waitEndTime_t [UTC] resource_runId_s resource_originRunId_s resource_actionName_s endTime_t [UTC]
12/15/2019, 12:58:57.627 AM 12/15/2019, 12:58:57.471 AM Invalid Date 08586252374669322278104928528CU37 Initialize_JobHttpStatusCode 12/15/2019, 12:58:57.549 AM
12/15/2019, 12:58:57.540 AM 12/15/2019, 12:58:57.471 AM Invalid Date 08586252374669322278104928528CU37 Initialize_JobHttpStatusCode Invalid Date
12/15/2019, 12:58:57.405 AM 12/15/2019, 12:23:38.550 AM 12/15/2019, 12:58:57.330 AM 08586252374669322278104928528CU37 08586252374669322278104928528CU37 Invalid Date
12/15/2019, 12:58:57.282 AM 12/15/2019, 12:58:56.901 AM Invalid Date 08586252353485738359883907091CU99 12/15/2019, 12:58:56.980 AM
12/15/2019, 12:58:57.258 AM 12/15/2019, 12:58:56.901 AM Invalid Date 08586252353485738359883907091CU99 Invalid Date
12/15/2019, 12:58:57.247 AM 12/15/2019, 12:58:56.909 AM Invalid Date 08586252353485738359883907091CU99 08586252353485738359883907091CU99 Invalid Date
12/15/2019, 12:23:39.275 AM 12/15/2019, 12:23:38.534 AM Invalid Date 08586252374669322278104928528CU37 12/15/2019, 12:23:39.034 AM
12/15/2019, 12:23:39.172 AM 12/15/2019, 12:23:38.534 AM Invalid Date 08586252374669322278104928528CU37 Invalid Date
12/15/2019, 12:23:39.143 AM 12/15/2019, 12:23:38.550 AM Invalid Date 08586252374669322278104928528CU37 08586252374669322278104928528CU37 Invalid Date
Logic App Suspend? Logs
Logic Apps will not be idle once its triggered. In your case, the Logic App would have retried few action for certain period. The retry policy is configurable upto 1 day. You can inspect and check the Logic App run if it has under gone some retries.
It seems there is something wrong in your "if condition". Your screenshot shows the "condition" took 0 millisecond, but it should be cancelled because of some error in the second "if condition". The run history of logic app shows 0s and misleading us.
I did a test in my side, please refer to my logic app below:
In my logic, I created a "delay" action to delay 10 minutes, then I run this logic app manually. The logic app ran the steps before the "delay" action quickly and then stayed at the "delay" action. Before 10 minutes, I use this api to cancel this logic app, we can see it was cancelled in "Run history"(shown as below screenshot).
Then I click this item in "Run history"(I clicked it after 10 minutes, if click it before 10 minutes, the condition will show x minutes and "running" status), the "condition" shows 0s(same to yours) but not 10 minutes or longer.
According to my test, the logic app run a few minutes but the "condition" just show 0s, so I think the "condition" in your logic app also took most of the time in 38 minutes, please check the two "if condition" in your logic app. I think it should be something wrong in your second condition and it result in the cancel of your first condition.
Related
I have a string (without doble qoutes) "2022-12-15 21:23:22 - a123456 (Remarks) 2022-12-15 22:12:22 - a123456 (Remarks) User acknowledgement time"
There are 2 date/time stamps in this string in need the date/time stamp appearing before User acknowledgement time. I am using regex (\.*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}.*User acknowledgement time) my regex is capturing very first time stamp showing up in string but I need date/time stamp right before User acknowledgement time. Please help.
(\.*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}.*User acknowledgement time)
I am expecting result as 2022-12-15 22:12:22 - a123456 (Remarks) User acknowledgement time
but I am getting result 2022-12-15 21:23:22 - a123456 (Remarks) User acknowledgement time
I think your group just needs to move. Group 1 is what you are after:
.*(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}).*User acknowledgement time
regex101
I currently have an alert setup for Data Factory that sends an email alert if the pipeline runs longer than 120 minutes, following this tutorial: https://www.techtalkcorner.com/long-running-azure-data-factory-pipelines/. So when a pipeline does in fact run longer than the expected time, I do receive an alert however, I am also getting additional & unexpected alerts.
My query looks like:
ADFPipelineRun
| where Status =="InProgress" // Pipeline is in progress
| where RunId !in (( ADFPipelineRun | where Status in ("Succeeded","Failed","Cancelled") | project RunId ) ) // Subquery, pipeline hasn't finished
| where datetime_diff('minute', now(), Start) > 120 // It has been running for more than 120 minutes
I received an alert email on September 28th of course saying a pipeline was running longer than the 120 minutes but when trying to find the pipeline in the Azure Data Factory pipeline runs nothing shows up. In the alert email there is a button that says, "View the alert in Azure monitor" and when I go to that I can then press "View Query Results" above the shown query. Here I can re-enter the query above and filter the date to show all pipelines running longer than 120 minutes since September 27th and it returns 3 pipelines.
Something I noticed about these pipelines is the end time column:
I'm thinking that at some point the UTC time is not properly configured and for that reason, maybe the alert is triggered? Is there something I am doing wrong, or a better way to do this to avoid a bunch of false alarms?
To create Preemptive warnings for long-running jobs.
Create activity.
Click on blank space.
Follow path: Settings > Elapsed time metric
Refer Operationalize Data Pipelines - Azure Data Factory
I'm not sure if you're seeing false alerts. What you've shown here looks like the correct behavior.
You need to keep in mind:
Duration threshold should be offset by the time it takes for the logs to appear in Azure Monitor.
The email alert takes you to the query that triggered the event. Your query is only showing "InProgress" statues and so the End property is not set/updated. You'll need to extend your query to look at one of the other statues to see the actual duration.
Run another query with the RunId of the suspect runs to inspect the durations.
ADFPipelineRun
| where RunId == 'bf461c8b-0b1e-43c4-9cdf-7d9f7ccc6f06'
| distinct TimeGenerated, OperationName, RunId, Start, End, Status
For example:
So, I have monitoring on error log file(mtail). It's just count number of error lines. And mtail sums number of new lines in file.
I want to send alerts when new error(s) occured each 10 minutes only. Not for every single error.
Please, can you provide exact values for these lines:
expr: increase(php_fpm_errors_total[10m]) > 0
for: 10m
I would appreciate if you provide me some doc links or explanation.
The way you have it, it will alert if you have new errors every time it evaluates (default=1m) for 10 minutes and then trigger an alert. There is also a property in alertmanager called group_wait (default=30s) which after the first triggered alert waits and groups all triggered alerts in the past time into 1 notification. You can remove the for: 10m and set group_wait=10m if you want to send notification even if you have 1 error but just don't want to have 1000 notifications for every single error
There is an issue with the current Heartbeat query we are using. The query works perfectly, however I have observed an issue while setting up the alert.
Breakdown of the query:
Type=Heartbeat Computer in $ComputerGroups[NON-PROD_Group] |
measure max(TimeGenerated) as LastCall by Computer | where LastCall < NOW-5MINUTE
Query checks for Heartbeat of computer in the group ‘NON-PROD_Group’
Measure max(TimeGenerated) as LastCall by Computer: will check for the time of last occurrence of Heartbeat from all the server and assign the value to a variable ‘LastCall’. ‘LastCall’ now has the time of last heartbeat
Where LastCall < NOW-5MINUTES: this section will check if the last heartbeat was before 5 minutes from ‘NOW’. Alert is triggered based on that.
I have given TIME WINDOW as 24 hour. The issue here is alert is generated for all the occurrences between ‘NOW-5MINUTES’ and 24 HOURS. There are no alerts generated if the LastCall falls outside the time window.
If the server is down for more than a day, no alerts will be generated.
For Instance, if Friday evening One of the server goes down, alert notifications will come in until Saturday evening(24 Hours is maximum time allowed) after that the alert clears and no more notifications are generated.
Monday morning, the alert would be cleared and will report everything is working fine.
Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType, ResourceGroup, Resource, ResourceProvider, ResourceType
| where LastHeartbeat < ago(5m)
// If you are looking for any specific resource group, you may add in where condition like ResourceGroup == 'ecom'
We are getting an error when we try to set it to a specific time every 7 days at a specific time. The doc says it is possible by using the optional [d] argument. We want to recycle every 7 days at 3 am.
http://technet.microsoft.com/en-us/library/cc754494(v=ws.10).aspx
Command :
C:\Windows\System32\inetsrv>appcmd set apppool /apppool.name: TempPool /+recycli
ng.periodicRestart.schedule.[value='7.03:00:00']
Error Message:
Application Pools
There was an error while performing this operation.
Details:
Timespan value must be between 00:00:00 and 23:59:59 seconds inclusive, with a granularity of 60 seconds
Although this question is a bit expired, but i faced it yesterday when i was writing some c# codes to manipulate an application pool programmly.
I found the sample for schedules at doc in following link which reads "adding an application pool... then set the application pool to daily recycle at 3:00 A.M.", which means we could not specify a fixed time span for recycling by add a schedule.
http://www.iis.net/configreference/system.applicationhost/applicationpools/add/recycling/periodicrestart/schedule/add#006
That's why it throws the exception to ask a time span under 23:59:59.
When you want specify a fixed time span for recycling, you should use time property from periodicRestart level.
See this doc for samples for various ways to target your requirement.
http://www.iis.net/configreference/system.applicationhost/applicationpools/add/recycling/periodicrestart#005
// add schedule to recycle at 3 am every day
appPool.Recycling.PeriodicRestart.Schedule.Clear();
appPool.Recycling.PeriodicRestart.Schedule.Add(new TimeSpan(3, 0, 0));
// set to recycle every 3 hours
appPool.Recycling.PeriodicRestart.Time = new TimeSpan(3, 0, 0);