How to validate incoming files in Azure data factory

How to validate incoming files in Azure data factory - azure

is there a way to create a template to validate incoming files including such checks as empty file checks, format, data types, record counts along, and will stop the workflow if any of the checks fail. The solution for this requirement should consider multiple file-formats and reduce the burden on ETL processing and checks to enable scale.
File transfer to occur either by trigger or data currency rule

Data Factory more focus on data transfer, not the file filter.
We could using the get metadata and if-condition to achieve some of the these feature, such as validate the file format, size, file name. You can use Get Metadata to get the file properties and If-condition can help you filter the file.
But that's too complexed for Data Factory to help you achieve all the features you want.
Update:
For example, we can parameter a file in source, :
Create dataset parameter filename and pipeline parameter name:
Using Get metadata to get its properties: Item type, Exists, Size, Item name.
Output:
For example, We can build expression in if-condition to judge if it's empyt(size=0):
#equals(activity('Get Metadata1').output.size,0)
If Ture means it's empty, False no empty. Then we can build the workflow in True or False active.
Hope this helps.

I demonstrate similar techniques to validate source files and take appropriate downstream actions in your pipeline based on those values in this video.

Related

How to insert item in CosmosDB(SQL API) from using Azure Data Factory activity

I have an ADF pipeline which is iterating over a set of files, performing various operations and I have an Azure CosmosDB (SQL API) instance where I would like to insert the name of file and a timestamp, mainly to keep track on which files have been already processed and which not, but in the future I might want to add some other bits of data related to each file.
What I have is my CosmosDB
And currently I am trying to utilice the Copy Data Activity for the insert part.
One problem that I have is that this particular activity expects source while at this point I have only the filename. In theory it was an option to use the Blob Storage from where I read the file at the beginning, but since the Blob Storage is set to store binary files I got the following error if I try to use it as source
Because of that I created a dummy CosmosDB Linked service, but I have several issues with this approach:
Generally the idea for dummy source is not very appealing to me
I haven't find a lot of information on the topic but it seems that if I want to use something in the Sink I need to SELECT from the source
Even though I have selected a value for the id the item is not saved with the selected value from the Source query, but as you can see from the first screenshot I got a GUID and only the name is as I want it.
So my questions are two. I just learn ADF but this approach doesn't look like the proper way to insert item into CosmosDB from activity, so a better/more common approach would be appreciated. If there is not better proposal, how can I at least apply my own value for the id column? If I create the item in the CosmosDB GUI and save it from there, as you can see I am able to use the filename as id which for now seems like a good idea to me, but I wasn't able to add custom value (string or int) when I was trying through the activity, so how can I achieve this?
This is how my Sink looks like

Create Data Factory Dataset in a specific Azure DevOps branch rather than directly in Data Factory

While trying to build an ADF pipeline that generates datasets within Data Factory, I ran into an interesting issue. Or maybe I misunderstand some components completely, in which case I'd happily be educated.
I basically read some meta data from a SQL Database table which determines which source system, schema and tables I should pull new data from. The meta data is stored within a bunch of variables, which then feed a Web Request that attempts to generate a new Data Source as per the MS documentation. Yes, I'm trying to use Azure Data Factory to generate Azure Data Factory components.
The URL to create the DataSet and the JSON Body for the request are both generated using #Concat and a number of the variables. The resulting DataSet is a very straightforward file that does not contain references to the columns, but just the table schema and table name. I generated these manually before, and that all seems to work brilliantly. I basically have a dataset connected to the source system, referincing the table from the meta data.
The code runs, but the resulting dataset is directly published, as opposed to being added in my working branch. While this should not be a big issue once I manage to properly test everything, ideally the object would be created in my working branch (using Azure DevOps, thus a local file).
My next thought was to set up a linked service to my local PC, and simply write the same contents as above there. My challenge seems to be that I essentially am creating a file out of nothing. I am trying to use a Copy Data component, and added an empty placeholder file to act as a source.
I configure the sink with Dynamic Content for Copy Behavior, and attempted to add the JSON contents there. This gets the file created, but it's unfortunately empty. I also attempted to add a new column to the source with the data being the same contents.
However, seeing the file to be used as a sink doesn't exist, a mapping error will occur. Apart from this, I'd not want a column header to be written; just the dynamically created contents.
I'm not sure how to continue with this. I feel I'm very close to achieving my goal, but cannot seem to take this final hurdle.
Any hints or suggestions would be very welcome.

Azure DataFactory Params - Newbie Question

I'm working with ADF and trying to leverage parameters to make life easier and reduce the number of objects being created in the ADF itself. What I am trying to do, would appear on the surface to be extremely simple, bu in reality its driving me slowly crazy. Would greatly appreciate any assistance!
I am trying to set up a parameterised dataset to be used as a sink target. Inside that dataset I have added a param named "filenames" of type string. In the connection tab I have added that param to the file part of the path. The folder part point to my Azure Data Lake folder and the file part is set to: #dataset().filename which is the result of choosing 'dynamic content' then selecting the param.
So far so good.. my sink target is, as far as I am aware, ready to receive "filenames" to write out to.
This is where it all goes wrong.
I now create a new pipeline. I want to use a list or array of values inside that pipeline which represent the names of the files I want to process. I have been told that I'll need a Foreach to send each of the values one at a time to the COPY DATA task behind the Foreach. I am no stranger to Foreach type loops and behaviors.. but for the life of me I CANNOT see where to set up the list of filenames. I can create a param as a type "array" but how the heck do you populate it?
I have another use case which this problem is preventing me from completing. This use case is, I think, the same problem but perhaps serves to explain the situation more clearly. It goes like this:
I have a linked service to a remote database. I need to copy data from that database (around 12 tables) into the data lake. At the moment I have about 12 "COPY DATA" actions linked together - which is ridiculous. I want to use a Foreach loop to copy the data from source to data lake one after the other. Again, I can set up the sink dataset to be parameterised, just fine... but how the heck do I create the array/list of table names in the pipeline to pass to the sink dataset?
I add the Foreach and inside the foreach a "COPY DATA" but where do I add all the table names?
Would be very grateful for any assistance. THANK YOU.

If you want to manually populate values of an array as a pipeline parameter, you create the parameter with Array type and set the value with syntax like: ["File1","File2","File3"]
You then iterate that array using a ForEach activity.
Inside the ForEach, you reference #item() to get the current file name value the loop is on.
You can also use a Lookup activity to get data from elsewhere and iterate over that using the ForEach.

Unable to map lookup activity output to Copy Activity Mapping in ADF

I am new to the ADF.
While I am trying to use Copy activity for moving data from API Call output to Blob Json, I am unable to use Lookup output. I am trying to map the fields explicitly in Mapping using #item().SiteID. But JSON output returns only with input fields (not the derived fields). Can someone help me to let me know how to achieve this?
Can I use Copy activity in For Each activity (#activity('LookupAvailableChannelListForExport').output.value) to pass Lookup output value (#item().siteID)in mapping between source and sink?

As i know, the output of Look Up Activity can't be source data in copy activity,even mapping between source and sink. Acutally, Look Up activity prefers the following usage according to official document:
Dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples
are files and tables.
I think the example from above link is a good interpretation.You could see that the output of Look Up activity is configured as dynamic sql db source dataset table name.Not the data in source.
Then back to your requirement,i think you could configure the source dataset as root folder if the files are stored in the same directory with same schema. And keep this option is selected so that all the data in all files will be grabbed.
If you want to implement some variant of source data, copy activity can't cover it but data flow activity could.You could use Derived column.Such as resetting the Json structure.

Azure Data Factory dynamic output path based on source dataset payload

I have a stream analytics job which constantly dumps data in Cosmos DB. The payload has a property "Type" which determines the payload itself. i.e. which columns are included in the payload. It is an integer value of either 1 or 2.
I'm using Azure Data Factory V2 to copy data from Cosmos DB to Data Lake. I've created a pipeline with an activity that does this job. I'm setting the output path folder name using :
#concat('datafactoryingress/rawdata/',dataset().productFilter,'/',formatDateTime(utcnow(),'yyyy'),'/')
What I want in the datafactory is to identify the payload itself, i.e. determine if the type is 1 or 2 and then determine if the data goes in folder 1 or folder 2. I want to iterate the data from Cosmos DB and determine the message type and segregate based on message Type and set the folder paths dynamically.
Is there a way to do that? Can I check the Cosmos DB document to find out the message type and then how do I set the folder path dynamically based on that?

Is there a way to do that? Can I check the Cosmos DB document to find
out the message type and then how do I set the folder path dynamically
based on that?
Unfortunately, based on the doc, dynamic content from source dataset is not supported by adf so far. You can't grab the fields in the source data as sink output dynamic parameters. Based on your situation, I suggest you setting up two separate pipelines to transfer data according to the Type field respectively.
If the Type field is varied and you do want to differentiate the output path, the ADF may not be the suitable choice for you. You could write logical code to fulfill your needs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string