I am creating a pipeline using ADF to copy the data in a XML file to a SQL database. I want this pipeline to be triggered when the XML file is uploaded to Blob Storage. Therefore, here I will be using a parameter with the input Dataset.
Now, in the Copy Data activity that I am using, I want to be able to define the mappings. This is usually quite easy when the path to the file is given, however, in this situation, where a parameter is being used, how can I do this?
From what I have gathered, the mappings can be defined as a JSON schema and assigned to the activity, but is there perhaps an easier way to do this? Maybe by uploading a demo file from which the schema can be imported?
When you want to load a xml file into sql DB you are using a Hierarchical source to tabular sink method.
When copying data from hierarchical source to tabular sink, copy activity supports the following capabilities:
Extract data from objects and arrays.
Cross apply multiple objects with the same pattern from an array, in which case to convert one JSON object into multiple records in tabular result.
You can define such mapping on Data Factory authoring UI:
On copy activity -> mapping tab, click Import schemas button to import both source and sink schemas. As Data Factory samples the top few objects when importing schema, if any field doesn't show up, you can add it to the correct layer in the hierarchy - hover on an existing field name and choose to add a node, an object, or an array.
Select the array from which you want to iterate and extract data. It will be auto populated as Collection reference. Note only single array is supported for such operation.
Map the needed fields to sink. Data Factory automatically determines the corresponding JSON paths for the hierarchical side.
Note: For records where the array marked as collection reference is empty and the check box is selected, the entire record is skipped.
Here I am using a sample XML file at source
If you notice here I have used a dataset parameter to which I will be assigning the file name value as obtained from trigger. And now I have placed it in the file name field for file path property in dataset connection.
Next I have created a pipeline parameter to hold the input obtained from trigger before assigning it to the dataset parameter.
Create storage event trigger
Click continue and you fill find a preview of all the files that are applicable for trigger conditions
When you have moved to next slide, if you have created pipeline parameter, which we have, you will see them there
Fill in the value as per your need. See the available system variables here Storage event trigger scope
Now, lets move to copy data activity, here you will find the dataset parameter, assign the pipeline parameter value to it.
Now move to sink tab in copy activity, since you want the source schema to be followed into sink, best way is to select to Auto create a table.
For which you have to make appropriate changes in sink dataset. Now, to configure sink dataset, for table choose edit and manually enter a name for table which does not already exist in your server i.e a new table will be created in this name in the sql server mentioned in sink. Make sure you clear all schema as you will be getting source schema in copy activity.
Back to mapping tab in copy activity, click on import schema and select the fields you want to copy to table. Additionally you can specify the data types and Collection reference is necessary.
Refer: Parameterize mapping
You can also switch to Advanced editor, in which case you can directly see and edit the fields' JSON paths. If you choose to add new mapping in this view, specify the JSON path.
So when a file is created in the storage a blob created event is triggered and pipeline runs
You can see the new table "dbo.NewTable" created under ktestsql and it has the data from xml as row.
Related
I am using azure data factory to have a soap API connection data to be transferred to snowflake. I understand that snowflake has to have the data in variant column or csv or we need to have intermediate storage in azure to finally land the data in snowflake. the problem I faced is the data from api is a string within that there is xml data. so when i put the data in blob storage, its a string. how do I avoid this and have the proper columns while putting the data ?
over here, the column is read as string. is there a way to parse it into their respective rows ? I tried to put the collection reference, it still does not recognize individual columns. Any input is highly appreciated.
You need to change to Advanced editor in Mapping section of copy activity. I took the sample data and repro'd this. Below are the steps.
Img:1 Source dataset preview
In mapping section of copy activity,
Click Import Schema
Switch to Advanced editor .
Give the collection reference value.
Img:2 Mapping settings
I am new to Azure Data Factory, and I currently have the following setup for a pipeline.
Azure Data Factory Pipeline
Inside the for each
The pipeline does the following:
Reads files for a directory everyday
Filters the children in the directory based on file type [only selects TSV files]
Iterates over each file and copies the data to Azure Data Explorer if they have the correct schema, which I have defined in mapping for the copy activity.
It copied files are then moved to a different directory and deleted from the original directory so that they aren't copied again.
[Question]: I want to delete or skip the rows which have null value in any one of the attributes.
I was looking into using data flow, but I am not sure how to use data flows to read multiple tsv files and validate their schema before applying transformations to delete the null records.
Please let me know if there is a solution where I can skip the null values in the for each loop or if I can use data flow to do the same.
If I can use data flow, how do I read multiple files and validate their column names (schema) before applying row transformations?
Any suggestions that would help me delete or skip those null values will be hugely helpful
Thanks!
Ok, inside the ForEach activity, you only need to add a dataflow activity.
The main idea is to do the filter/assert activity then you write to multiple sinks.
ADF dataflow :
Source:
add your tsv file as requested, and make sure to select in After completion ->Delete source files this will save you from adding a delete activity.
Filter activity:
Now, depends on your use case, do you want to filter rows with null values? or do you want to validate that you don't have null values.
if you want to filter, just add a filter activity, in filter settings -> filter on -> 'here add your condition'.
if you need to validate rows and make the dataflow fail, use the assert activity
filter condition : false(isNull(columnName))
Sink:
i added 2 sinks,one for ADE and one for new directory.
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-assert
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-filter
https://microsoft-bitools.blogspot.com/2019/05/azure-incremental-load-using-adf-data.html
please consider the incremental load and change the dataflow accordingly.
I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:
Is there any way we can fetch the max of last modified date from the last processed file and store it in a config table
From Supported data stores and formats you can see Salesforce, Salesforce service cloud and Marketing cloud are supported.
You have to perform the following steps:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Follow this to setup Linked Service with Salesforce in Azure Data Factory
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has different syntax and functionality support, do not mix it. You are suggested to use the SOQL query, which is natively supported by Salesforce.
Process for incremental loading, or delta loading, of data through a watermark:
In this case, you define a watermark in your source database. A watermark is a column that has the last updated time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark and a new watermark. The workflow for this approach is depicted in the following diagram:
ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated file since last time to the destination store.
For capabilities, Prerequisites and Salesforce request limits refer Copy data from and to Salesforce by using Azure Data Factory
Refer doc: Delta copy from a database with a control table This article describes a template that's available to incrementally load new or updated rows from a database table to Azure by using an external control table that stores a high-watermark value.
This template requires that the schema of the source database contains a timestamp column or incrementing key to identify new or updated rows.
The template contains four activities:
Lookup retrieves the old high-watermark value, which is stored in an external control table.
Another Lookup activity retrieves the current high-watermark value from the source database.
Copy copies only changes from the source database to the destination store. The query that identifies the changes in the source database is similar to 'SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > “last high-watermark” and TIMESTAMP_Column <= “current high-watermark”'.
StoredProcedure writes the current high-watermark value to an external control table for delta copy next time.
Go to the Delta copy from Database template. Create a New connection to the source database that you want to data copy from.
Create connections to the external control table and stored procedure that you created and Select Use this template.
Choose the available pipeline
For Stored procedure name, choose [dbo].[update_watermark]. Select Import parameter, and then select Add dynamic content.
click Add dynamic content and Type in below query. This will get a maximum date in your watermark column that we can use for delta slice.
You can use this query to fetch the max of last modified date from the last processed file
select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
or
For files only you can use Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
Refer: Source: ADF Incremental loading with configuration stored in a table
I'm trying to use Azure Data Factory to take csv's and turn them into SQL tables in the DW.
The columns will change often so it need's to be dynamically taking the csv's schema.
I've tried using get metadata to get the structure and data type, but I'm unable to parse it into the relevant format to create the sql table.
has anyone done anything like this on ADF? Is it possible?
Yes - it takes a bit of configuration, but you can accomplish this with Azure Data Factory Data Flow (ADFDF).
Create a DataSet pointing to your CSV location (I'm assuming Azure Blob Storage).
Initially, select a specific CSV file.
On the Schema tab, click "Import schema". It is OK that this will change later, but the DataSet must have a schema at design time.
On the Parameters tab, create a parameter for the blobName.
On the Connection tab, reference that parameter in the "File" box. You will set its value in the pipeline at runtime. [This overrides the initial value used to define the schema].
Create a DataSet for the SQLDW table.
Select "Create new table"
Add the schema and table names [this should be configurable/overrideable later via DataSet parameters if needed]
The Schema tab will show no Schema.
Create a DataFlow to move the data from CSV to SQLDW.
SOURCE: select the DataSet created in step 1.
On the Source Settings tab: Make sure "Allow schema drift" is checked and "Validate schema" is unchecked [These are the default settings].
CHECK "Infer drifted column types", which is NOT the default.
SINK: select the DataSet created in step 2.
On the Sink tab: Make sure "Allow schema drift" is checked and "Validate schema" is unchecked [These are the default settings].
On the Settings tab, change "Table action" to "Recreate table". This should infer the new schema and drop and create the columns based on what it finds.
On the Mappings tab: make sure "Auto Mapping" is enabled [should be by default]
In the Pipeline:
Create a parameter for "blobName"
Select the Data Flow activity:
On the Settings tab: set the source parameter for blobName to the pipeline parameter you just created.
SQLDW specific: you will need to provide a Blob Storage Linked Service and location for Polybase.
CAVEATS
From what I've seen, every column in the SQLDW table is created as NVARCHAR(MAX). I thought the "Infer drifted column types" would address this, but apparently not.
This configuration assumes that the first row of the CSV is a header row.
The Sink operation will fail if the incoming column names in the header row contain spaces or special characters. To combat this in a production scenario, you should add a SELECT in between the Source and Sink activities in the Data Flow, then use the new Rule-based mapping and expressions to strip out any invalid characters.
My example uses the same SQLDW schema and table name every time, but as mentioned in step 2 above, you should be able to create DataSet parameters to override those at runtime if needed.