Call Azure ML Batch Job via Azure Data Factory - azure

I have scheduled the Azure ML Batch Job via Azure Data Factory to run daily at 12:00 AM UTC.
Don't know what is the issue, but it is failing for every month's 3rd day, otherwise it runs perfectly.
Anybody facing same issue?
For September
For October

It looks like ADF is successfully invoking ML and reporting back the "not converging" error. Could there be something specific in the input data that could cause this problem? Is there anything in the ML model that is handling dates as monthly that could be impacted by daily execution, around the start/end of the month (especially if there is any data offset or delay)?

It is likely data related. The error is being returned from the batch execution system when the model is trying to score the data. I would look for duplicate Ids being inserted. Or any specific data that is being passed that could cause problems for this model.
Having the job id and the Azure region this is running would help us look up the specific error.

Related

Why change feed lag estimator showing lag in millions?

I am working on cosmos db change feed for a real time project. we are running our webjobs in azure app service with P3V2 specification. there are multiple webjobs running using change feed. So to monitor the processes we have used the change feed lag estimator for monitoring record lags. the implementation is according to following document.
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-use-change-feed-estimator
For one of the webjob in the .net core code we have put a delay of 10 mins using await Task.delay() function. for that specific webjob we are getting estimation in millions even though the records which we are processing are not more than 100.
This is kind of uncertain behavior we are observing. can anyone help to find the exact reason?
Is the Estimator matching a processor that is currently running and processing documents? Normally what you describe matches a scenario where the Processor is not running/never ran or never completed a successful run on some of the leases.
You can use the detailed estimation to understand how the lag is distributed across leases: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/how-to-use-change-feed-estimator#as-an-on-demand-detailed-estimation

ADF Mapping Data Flows failing with BatchUpdateException

I have a number of Mapping Data flows that have been running regularly for the past several months, and some of them started failing yesterday.
The data flow pattern is -
Source: 2 Azure SQL DB tables, a lookup table in Synapse
Sink: 1 table in Synapse (Azure SQL DB)
We have enabled Polybase Staging for better performance, as each activity takes too long without it, and have a linked service to an Azure Blob Storage account for this.
Last night's run failed midway for some of our larger tables with the following error, but the smaller tables were all successful. Nothing has changed on any of these pipelines, or on any of the linked services in several months.
Going into debug mode, I can't look at the data preview for any of the Synapse sink activities unless I disable the 'Staging' option in settings. If I try with Staging enabled, it says "Blob storage staging properties should be specified", which I have entered in debug settings, yet still get the error.
The strange thing is that this problem is only occurring on the data flows moving larger amounts of data, the smaller tables are fine in debug mode as well. All these data flows were successful 2 days ago, so is this perhaps a space issue in Blob Storage?
The pipeline activity error code:
{"StatusCode":"DFExecutorUserError",
"Message":"Job failed due to reason: at Sink 'SinkIntoSynapse':
java.sql.BatchUpdateException: There are no batches in the input script.",
"Details":"at Sink 'SinkIntoSynapse':
java.sql.BatchUpdateException: There are no batches in the input script."}
I have seen this caused by having a commented out SQL statement in the Pre-copy script section of the sink setting.
If you have anything in the Pre-copy script section, try removing it before publishing and running the data facatory again.
I confirm what Kevin said: in my case, I had started writing a SQL script, even after I cancelled it, I was still getting the error.
Try clicking the Recycle Bin icon, shown in the screenshot.
Worked for me.

ADFv2 Queue time

I have a pipeline with a few copy activities. Some of those activities are in charge of copying large amounts of data from a storage account to the same storage account but in a compressed manner (I'm talking about a few TB of data).
After running the pipeline for a few hours I noticed that some activities show "Queue" time on the monitoring blade and I was wondering what can be the reason for that "Queue" time. And more importantly if I'm being billed for that time also because from what I understand my ADF is not doing anything.
Can someone shed some light? :)
(Posting this as an answer because of the comment chars limit)
After a long discussion with Azure Support and reaching out to someone at the ADF product team I got some answers:
1 - The queue time is not being billed.
2 - Initially, the orchestration ADF system puts the job in a queue and it gets "queue time" until the infrastructure picks it up and start the processing part.
3 - In my case the queue time was increasing after the job started because of a bug in the underlying backend executor (it uses Azure Batch). Apparently the executors were crashing and my job was suffering from "re-pickup" time, thus increasing the queue time. This explained why after some time I started to see that the execution time and the transferred data were decreasing. The ETA for this bugfix is at the end of the month. Additionally the job that I was executing timed out (after 7 days) and after checking the billing I confirmed that I wasn't charged a dime for it.
Based on the the chart in this ADF Monitor, you could find the same metrics in the example.
In fact,it's metrics in the executionDetails parameter.Queue Time+ Transfer Time= Duration Time.
More details on the stages copy activity goes through, and the
corresponding steps, duration, used configurations, etc. It's not
recommended to parse this section as it may change.
Please refer to the Parallel Copy, copy activity will create parallel tasks to transfer data internally. Activities are all in active state in both queue time and transfer time, never stop in queue time so that it's billed during the whole duration time. I think it's inevitable loss in data transfer process and has been digested by adf internally. You could try to adjust parallelCopies param to see if anything changes.
If you do concern the cost, you could submit feedback here to ask for statements from Azure Team.

Slow AAD Differential Query for

The Azure AD differential query works well and fast when we query the difference between actual Azure AD and previous state not older than 30-60 minutes. But when we query for a week ago or month ago – it takes 10 minutes to return changes – even if Azure directory is small and there are 3-4 changed attributes for this period of time, what is very slow. Is it expected behavior? Are there any workarounds?
Based on the test, it is not a expected behavior. My first differential query request stated at 10/10/2016. And today I test the differential query REST using the Fiddler cost about 30 seconds.
To narrow this issue, I suggest that you call the other service or call this REST in different network to ensure the issue is not caused by the network. Other Azure Graph REST is also recommend to test to see whether the issue is Azure Active Directory related.
For sure it's not... I can query 21K users in 3-4 minutes over a 24 Mbit DSL with partial properties (only those I want) and less than 10 minutes for all properties (and objects have almost all properties set so deserialization if fully in effect)
Delta queries, a few seconds, always.
Are you using your own routine over a basic HTTP client or are you using classes provided in the MS.Azure.AD assembly?

Diagnosing errors in StreamAnalytics Jobs

I've got a series of services that generate events that are being written to an Azure Event Hub. This hub's connected to a StreamAnalytics Job that takes the event information and writes it to an Azure TableStorage and DataLake Store for later analysis by different teams and tools.
One of my services is reporting all events correctly, but the other isn't, after hooking up a listener to the hub I can see the events are being sent without a problem, but they aren't being processed or sent to the sinks on the job.
On the audit logs I see periodic transformation errors for one of the columns that's written to the storage, but seeing the data there's no problem on the format, and I can't seem to find a way to maybe look at the troubled events that are causing this failures.
The only error I see on the Management Services is
We are experiencing issues writing output for output TSEventStore right now. We will try again soon.
It sounds like there may be two issues:
1) The writing to the TableStorage TSEventStore is failing.
2) There are some data conversion errors.
I would suggest trying to troubleshoot one at a time. For the first one, are there any events being written to the TSEventStore? Is there another message in operations logs that may give more detail on why writing is failing?
For the second one, today we don't have a way to output events that have data conversation errors. The best way is by outputting the data only to one sync (data lake) and looking at it there.
Thanks,
Kati

Resources