Is there a way to stop Azure ML throwing an error when exporting zero lines of data? - azure

I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?

It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location

This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.

Related

Azure Monitor Workbook - run ARM query per each resource

I'm looking into Azure Workbooks now, and I'm wondering if the below is actually even possible.
Scenario
List all function apps under Subscription / Resource Group scope.
This step is done - simple Azure Resource Graph query with parameters does the job.
Execute Azure Resource Manager action against each Function App that was returned as a part of Query in Step1.
Specifically, I'm interested in detectors/functionExecutionErrors ARM API - and return a parsed result out of it. When doing that for hardcoded resource, I can get the results I need. Using the following JSON Path $.properties.dataset[0].table.rows[0][1] I get back the summary: All running functions in healthy state with execution failure rate less than 0.1%.
I realize this might either be undoable in Workbooks or something trivial that I missed - it would be easiest if I could just run 'calculated columns' when rendering outputs. So, the summarizing question is:
How, if possible, can I combine Azure Resource Graph query with Azure Resource Manager DataSource, where Azure Resource Manager query runs per each returned Graph resource and display them as table in form: "Resource ID | ARM api results".
I think I have achieved closest result to this by marking Resource Graph Query output as parameter (id -> FunctionAppId) and referencing that in ARM query as /{FunctionAppId}/detectors/functionExecutionErrors - this works fine as long as only one resource is selected, but there are two obstacles: I want to execute against all query results regardless if they are selected, and I need Azure Resource Manager understand it needs to loop resources - not concatenate them (as seen in invoke HTTP call from F12 dev tools, the resource names are just joined together).
Hopefully there's someone out there who could help out with this. Thanks! :-)
I'm also new to Workbooks and I think creating a parameter first with the functionId is best. I do the same ;)
With multiple functions the parameter will have them all. You can use split() to get an array and then loop.
Will that work for you?
Can you share your solution if you managed to solve this differently?
cloudsma.com is a resource I use a lot to understand the queries and options better. Like this one: https://www.cloudsma.com/2022/03/azure-monitor-alert-reports
Workbooks doesn't currently have an ability to run the ARM data source against many resources, though it is on our backlog and are actively investigating a way to run any datasource for a set of values and merge the results togther.
The general workaround is to do as stated, either use a parameter so select a resource and run that one query for the selected item, or do similar with something like a query step with grid, and have selection of the grid output a parameter used as input to the ARM query step.

Custom Validation error reporting in Data Factory

I'm using Azure Data Factory to build some file to db imports and one of the requirements I have is if a file isn't valid. e.g. either a column is missing or contains incorrect data (wrong data type, lookup doesn't exist in a db) then an alert is sent detailing the errors. Errors should be regular human readable so rather than a SQL error saying insert would violate a forign key, it should say incorrect value entered for x.
This doc (https://learn.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows) describes a way of using conditional splits to add custom validation that would certainly work to allow me to import the good data and write the bad data to another file with custom error messages. But how can I then trigger an alert with this? As far as I can tell, this will result in the data flow reporting success and to do something like calling a logic app to send an email needs to be done in the pipeline rather than data flow.
That’s a good point, but couldn’t you write the bad records to an error table/file, then give aggregated summary of how many records erred, counts of specific errors, that would be passed to logic app/SendGrid API to alert interested parties of the status. It would be post-data flow completion activity that checks to see if there is an error file or error records in the table, if so, aggregate and classify, then alert.
I have a similar notification process that gives me successful/erred pipeline notifications, as well as 30 day pipeline statistics... % pipeline successful, average duration, etc.
I’m not at my computer right now, otherwise I’d give more detail with examples.
To catch the scenario when the rows copied and rows written are not equal , may be you can use output of the copy active and if the difference is not 0 , send an alert .

New item inserted in Azure Table Storage is not immediately available

I have
an endpoint in an Azure Function called "INSERT" that that inserts a
record in Table Storage using a batch operation.
an endpoint in a different Azure Function
called "GET" that gets a record in Table Storage.
If I insert an item and then immediately get that same item, then the item has not appeared yet!
If I delay by one second after saving, then I find the item.
If I delay by 10ms after saving, then I don't find the item.
I see the same symptom when updating an item. I set a date field on the item. If I get immediately after deleting, then some times the date field is not set yet.
Is this known behavior in Azure Table Storage? I know about ETags as described here but I cannot see how it applies to this issue.
I cannot easily provide a code sample because this is distributed among multiple functions and I think if I did put it in a simpler example, then there would be some mechanism that would see I am calling from the same ip or with the same client and manage to return the recently saved item.
As mentioned in the comments, Azure Table Storage is Strongly Consistent. Data is available to you as soon as it is written to Storage.
This is in contrast with Cosmos DB Table Storage where there are many consistency levels and data may not be immediately available for you to read after it is written depending on the consistent level set.
The issue was related to my code and queues running in the background.
I had shut down the Function that has queue triggers but to my surprise I found that the Function in my staging slot was picking items off the queue. That is why it made a difference whether I delay for a second or two.
And to the second part, why a date field is seemingly not set as fast as I get it. Well, it turns out I had filtered by columns, like this:
var operation = TableOperation.Retrieve<Entity>(partitionKey, id, new List<string> { "Content", "IsDeleted" });
And to make matters worse, the class "Entity" that I deserialize to, of course had default primitive values (such as "false") so it didn't look like they were not being set.
So the answer does not have much to do with the question, so in summary, for anyone finding this question because they are wondering the same thing:
The answer is YES - Table Storage is in fact strongly consistent and it doesn't matter whether you're 'very fast' or connect from another location.

Azure Data Factory prohibit Pipeline Double Run

I know it might be a bit a confusing title but couldn't get up to anythig better.
The problem ...
I have a ADF Pipeline with 3 Activities, first a Copy to a DB, then 2 times a Stored procedure. All are triggered by day and use a WindowEnd to read the right directory or pass a data to the SP.
There is no way I can get a import-date into the XML files that we are receiving.
So i'm trying to add it in the first SP.
Problem is that once the first action from the pipeline is done 2 others are started.
The 2nd action in the same slice, being the SP that adds the dates, but in case history is loaded the same Pipeline starts again a copy for another slice.
So i'm getting mixed up data.
As you can see in the 'Last Attempt Start'.
Anybody has a idea on how to avoid this ?
ADF Monitoring
In case somebody hits a similar problem..
I've solved the problem by working with daily named tables.
each slice puts its data into a staging table with a _YYYYMMDD after, can be set as"tableName": "$$Text.Format('[stg].[filesin_1_{0:yyyyMMdd}]', SliceEnd)".
So now there is never a problem anymore of parallelism.
The only disadvantage is that the SP's coming after this first have to work with Dynamic SQL as the table name where they are selecting from is variable.
But that wasn't a big coding problem.
Works like a charm !

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Resources