Azure Machine Learning - Every pipeline run is canceled automatically after 30 minutes - azure

since 5 days every pipeline run is canceled (not failed) automatically after round about 30 minutes.
The pipeline stage shows the error message:
Response status code does not indicate success: 403 (Identity does not have permissions for Microsoft.MachineLearningServices/workspaces/experiments/runs).
Microsoft.RelInfra.Common.Exceptions.ErrorResponseException: Identity does not have permissions for Microsoft.MachineLearningServices/workspaces/experiments/runs/read actions.
runs
I verified that my user has the rights of the role (owner and contributor). So every write/ read access should be there.
access rights
I created a completly new machine learning ressource and tried with different users (role contributor).
The error message does not make sense, because the pipeline starts and runs for 30 minutes (every user right is there?!) and is canceled automatically after some time. If I re-run the pipeline step from the ML studio, the step succeed.
I am working with Azure Machine Learning for 4 months and everything went fine.
The pipeline is created with a local python SDK.
The abonnement contains the payed version.
EDIT:
The 70_driver_log log does not show any error message - it stops after printing some own messages.
70_driver_log.txt
7%|█ | 10388/139445 [12:55<2:00:20, 17.87it/s]
7%|█ | 10390/139445 [12:55<3:04:23, 11.67it/s]
executionlogs.txt:
[2021-05-25 12:00:47Z] Job is running, job runstatus is Running
[2021-05-25 12:02:49Z] Cancelling the job

Related

Azure Function Error: The operation has timed out

An error has started popping up in my Azure Data Factory Pipeline. I have a few Azure Function steps in the pipeline, but for some reason, one of the Azure Function steps has started returning an error. In Azure Data Factory, the error is a 3608 code after running for 1 minute 40 seconds:
Failure type: User configuration issue
Details: Call to provided Azure function 'CollateSheetsHTTPTrigger' failed with status-'InternalServerError' and message - 'Invoking Azure function failed with HttpStatusCode - InternalServerError.'.
However, in a prior run sub-pipeline, this Azure Function ran successfully on the same data (parameters and worksheet are on the only difference). The subsequent 3 runs of pipelines fail immediately (after 2 seconds) at the first Azure Function (a different AZ function now) step in each, with the same 3608 error code but different details:
Call to provided Azure function '???????????????' failed with status-'NotFound'
and message - '<html> <head><title>404 Not Found</title></head> <body
bgcolor="white"> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center>
</body> </html> '.
Now it gets even stranger. After these 3 failed pipelines, the next pipeline which is pretty much the same as the previous 4 except for a few parameters, runs successfully, even though it has the same 2 AZ functions that failed before. And then the next 2 pretty similar pipelines also run successfully.
I then went and looked at the monitoring page for the 2 Azure Functions:
The first AZ function that failed, had 2 errors even though it only failed once in AZ Data Factory... the timing is slightly different for the 2 errors but they could only come from the first failed pipeline, so why does it say there are 2 errors? Then if you look at the actual error, all it says is "The operation timed out". The function was not running for more than 150 seconds so this is strange. Additionally, I have a bunch of error catching code and nothing comes up there.
The other failed AZ function steps from the other function do not show up on the monitoring page, it seems as if the first error crashed the AZ function app and then it eventually restarted?
I'm sorry I can't help but I did have a similar problem with an Azure function that executes a SOAP-call to a webservice every minute. Since 4 days this also fails with a timeout. If I run the function within my debugger it runs without problems. But the Azure Function fails every time, after 20 sec.
I'll follow this question and hope someone else can help...
An Azure Support Engineer identified the issue, it was due to a change to the azure-function-host library. The relevant issue is here https://protect-eu.mimecast.com/s/-CT5C3QxrTmREBwhgPxLm?domain=github.com and was fixed last week

Azure Datafactory Pipeline Failed inside a scheduled trigger

I have created 2 pipeline in Azure Datafactory. We have a custom activity created to run a python script inside the pipeline.When the pipeline is executed manually it successfully run for n number of time.But i have created a scheduled trigger of an interval of 15 minutes in order to run the 2 pipelines.The first execution successfully runs but in the next interval i am getting the error "Operation on target PyScript failed: Hit unexpected exception and execution failed." we are blocked wiht this.any input on this would be really helpful.
from ADF troubleshooting guide, it states...
Custom Activity :
The following table applies to Azure Batch.
Error code: 2500
Message: Hit unexpected exception and execution failed.
Cause: Can't launch command, or the program returned an error code.
Recommendation: Ensure that the executable file exists. If the program started, make sure stdout.txt and stderr.txt were uploaded to the storage account. It's a good practice to emit copious logs in your code for debugging.
Related helpful doc: Tutorial: Run Python scripts through Azure Data Factory using Azure Batch
Hope this helps.
If you are still blocked, please share failed pipeline run ID & failed activity run ID, for further analysis.

Error deploying webjob schedule - Response status code does not indicate success: 409 (Conflict)

I have a project with three scheduled webjobs. They all deploy correctly from Visual Studio, but it can't create a schedule for the third one. I get the following error:
webjobs.console.targets(110,5): Error : An error occurred while
creating the WebJob schedule: Response status code does not indicate
success: 409 (Conflict).
There's nothing special about my schedule in webjob-publish-settings.json:
{
"$schema": "http://schemastore.org/schemas/json/webjob-publish-settings.json",
"webJobName": "...",
"startTime": "2015-12-07T00:00:00-05:00",
"endTime": null,
"jobRecurrenceFrequency": "Day",
"interval": 1,
"runMode": "Scheduled"
}
I tried adding the schedule manually from the Azure portal and got a bit more information.
Job collection 'WebJobs-EastUS' reaches maximum number of jobs
allowed.
It turns out that you can only have 5 jobs per collection. This project has 3 jobs and two environments, so there are 6 in total. I created a new job schedule in a new collection, then deleted the job, and tried redeploying to see if it used the new empty collection. It did not, and I got the same error.
Next, I deleted a job in the original collection and redeployed. That time it worked fine. This isn't an ideal solution, since I'm still limited to 5 jobs when I need 6.
Is there a way to specify the job collection to use for the scheduler? Or is there something else I'm missing?
You can manage the scale of the Scheduler JobCollection used by your WebJobs in the old portal. Navigate to Scheduler/JobCollections and increase the scale on your Scheduler JobCollection to increase your job limit. This blog post shows where to find this stuff in the portal, and also details how WebJobs + Azure Scheduler work behind the scenes.
However, we highly recommend using the new inbuilt scheduling mechanism detailed in this blog post. This mechanism keeps the schedule with your job and involves no outside dependencies.

What is SnapshotHelper on Windows Azure

Occasionally (rarely) my Azure website will freeze and eventually return 502 errors (seems like it takes 5 min). I see a whole bunch of items in my trace log related to 'snapshot helper'.
I haven't explicitly used this and google only seems to return results for VMWare. Anyone know what this is? If it is azure taking a backup of my site, is there some way I can schedule it for an slower time than 11:00 am EST?
SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen
SnapshotHelper::TakeSnapshot time since last: 01:19:59.9600775
SnapshotHelper::RestoreSnapshotInternal SUCCESS - File.Copy
SnapshotHelper::RestoreSnapshotInternal SUCCESS - process
SnapshotHelper::TakeSnapshotTimerCallback
SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen
SnapshotHelper::TakeSnapshotTimerCallback
SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen
SnapshotHelper::TakeSnapshot time since last: 00:19:59.9866142
SnapshotHelper::TakeSnapshotTimerCallback
SnapshotHelper::TakeSnapshotInternal - no new files in CodeGen
I don't think you have control over when backups occur, but you can add a second redundant VM so that users will hit this one if the first is unavailable.

Quick Deployment Job failing when no item in Quick deploy item list

I have a content deployment job from one server to another....content deployment job works fine but when I turn on Quick Deploy job it start showing me system event error...
I later figured it out that quick deploy is working fine if there is atleast one item in quick deploy items list otherwise its giving me error.
In quick deploy settings I put it as after every 30 minutes so I am getting error after every 30 minutes in system event....
The Execute method of job definition Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJobhe Execute method of job definition Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJobDefinition (ID daa20dd3-f6ad-4e27-923a-1ebf26c71723) threw an exception. More information is included below.
ContentDeploymentJobReport with ID '{00000000-0000-0000-0000-000000000000}' was not found.
Parameter name: jobReportId
and
Publishing: Content deployment job failed. Error: 'System.ArgumentOutOfRangeException: ContentDeploymentJobReport with ID '{00000000-0000-0000-0000-000000000000}' was not found.
Parameter name: jobReportId
at Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJobReport.GetInstance(Guid jobReportId)
at Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJob.get_LastReport()
at Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJob.get_SQMDeploymentJobFlags()
at Microsoft.SharePoint.Publishing.Administration.ContentDeploymentJob.
I found out that if we have any item in our authoring environment which needs to get to Production in quick deployment list...we are not getting this error...so i just find the workaround for now by deploying any page after setting up my content deployment jobs...that way I am not getting this issue....

Resources