Custom Validation error reporting in Data Factory

Custom Validation error reporting in Data Factory - azure

I'm using Azure Data Factory to build some file to db imports and one of the requirements I have is if a file isn't valid. e.g. either a column is missing or contains incorrect data (wrong data type, lookup doesn't exist in a db) then an alert is sent detailing the errors. Errors should be regular human readable so rather than a SQL error saying insert would violate a forign key, it should say incorrect value entered for x.
This doc (https://learn.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows) describes a way of using conditional splits to add custom validation that would certainly work to allow me to import the good data and write the bad data to another file with custom error messages. But how can I then trigger an alert with this? As far as I can tell, this will result in the data flow reporting success and to do something like calling a logic app to send an email needs to be done in the pipeline rather than data flow.

That’s a good point, but couldn’t you write the bad records to an error table/file, then give aggregated summary of how many records erred, counts of specific errors, that would be passed to logic app/SendGrid API to alert interested parties of the status. It would be post-data flow completion activity that checks to see if there is an error file or error records in the table, if so, aggregate and classify, then alert.
I have a similar notification process that gives me successful/erred pipeline notifications, as well as 30 day pipeline statistics... % pipeline successful, average duration, etc.
I’m not at my computer right now, otherwise I’d give more detail with examples.

To catch the scenario when the rows copied and rows written are not equal , may be you can use output of the copy active and if the difference is not 0 , send an alert .

Related

BadGateway Power Automate using Execute Script Action

I'm using the Run Script action to transfer data between Excel sheets and I'm facing the Badgateway error, saying that the action has timed out. The export action converts the data to JSON format, which is then used as input in the action to import the data and transform the JSON into table format again. The amount of data transferred is large (>10000), however, even when using a smaller database (1000 rows x 3 columns) the error appears sometimes, which may lead me to the conclusion that it may not be the amount of data and yes, Microsoft's database is low at the time of running Flow, so it can't fulfill my request.
I would like to know if any Power Automate plan can help to solve this problem, if any license allows the user to use a greater capacity of the database or have a space destined in which the connection to the server does not fail at the time of executing the flow, flawlessly processing my request. Or if it is a problem that I must solve by decreasing the amount of data transferred in this format, and if is it, how can I measure this quantity of Data Power Automate can process.

Is there a way to stop Azure ML throwing an error when exporting zero lines of data?

I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?

It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location

This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.

The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

CRM 365 Workflows and importing data file from Excel

When designing Workflows you have a chance to indicate how it is triggered.
In my particular case I am interested to detect changes in the Status Reason and, for specific states, do something. I can use the "After" filed change on the Status Reason or a Wait condition and everything looks to be OK.
The question I have is in the relation to an Excel Export/Import used for bulk operations. In this case the user can change (using Excel) the Status Reason field to a value matching the condition in the workflow.
Assuming the workflow is Activated at the time of Excel import, does the workflow get triggered for every row imported?
It might be very inefficient from a timing perspective but for small data sets might be beneficial and acting as a bulk update, which in fact I am looking for.
Thank you!

For your question,
Yes workflow does get triggered every time you Import data using Excel and it matches the criteria for your Workflow.
Workflow run on server side that means, they will trigger every time value changes in Database and matches criteria. You could run your workflow in asynchronous mode and Crm Async job will take care of allocating resources as and when it has capacity. In this way you will not see performance impact when you Import data via Excel.

Azure table storage from Python function consistently failing when looking up for missing entity

I have the following setup:
Azure function in Python 3.6 is processing some entities utilizing
the TableService class from the Python Table API (new one for
CosmosDB) to check whether the currently processed entity is already
in a storage table. This is done by invoking
TableService.get_entity.
get_entity method throws exception each
time it does not find an entity with the given rowid (same as the
entity id) and partition key.
If no entity with this id is found then i call insert_entity to add it to the table.
With this i am trying to implement processing only of entities that haven't been processed before (not logged in the table).
I observe however consistent behaviour of the function simply freezing after exactly 10 exceptions, where it halts execution on the 10th invocation and does not continue processing for another minute or 2!
I even changed the implementation of instead doing a lookup first to simply call insert_entity and let it fail when a duplicate row key is added. Surprisingly enough the behaviour there is exactly the same - on the 10th duplicate invocation the execution freezes and continues after a while.
What is the cause of such behaviour? Is this some sort of protection mechanism on the storage account that kicks in? Or is it something to do with the Python client? To me it looks very much something by design.
I wasnt able to find any documentation or settings page on the portal for influencing such behaviour.
I am wondering of it is possible to implement such logic with using table storage? I don't seem to find it justifiable to spin up a SQL Azure db or Cosmos DB instance for such trivial functionality of checking whether an entity does not exist in a table.
Thanks

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Custom Validation error reporting in Data Factory - azure

To catch the scenario when the rows copied and rows written are not equal , may be you can use output of the copy active and if the difference is not 0 , send an alert .

Related

BadGateway Power Automate using Execute Script Action

Is there a way to stop Azure ML throwing an error when exporting zero lines of data?

Best way to run a script for large userbase?

CRM 365 Workflows and importing data file from Excel

Azure table storage from Python function consistently failing when looking up for missing entity

Categories

Resources