Skip null rows while reading Azure Data Factory

Skip null rows while reading Azure Data Factory - azure

I am new to Azure Data Factory, and I currently have the following setup for a pipeline.
Azure Data Factory Pipeline
Inside the for each
The pipeline does the following:
Reads files for a directory everyday
Filters the children in the directory based on file type [only selects TSV files]
Iterates over each file and copies the data to Azure Data Explorer if they have the correct schema, which I have defined in mapping for the copy activity.
It copied files are then moved to a different directory and deleted from the original directory so that they aren't copied again.
[Question]: I want to delete or skip the rows which have null value in any one of the attributes.
I was looking into using data flow, but I am not sure how to use data flows to read multiple tsv files and validate their schema before applying transformations to delete the null records.
Please let me know if there is a solution where I can skip the null values in the for each loop or if I can use data flow to do the same.
If I can use data flow, how do I read multiple files and validate their column names (schema) before applying row transformations?
Any suggestions that would help me delete or skip those null values will be hugely helpful
Thanks!

Ok, inside the ForEach activity, you only need to add a dataflow activity.
The main idea is to do the filter/assert activity then you write to multiple sinks.
ADF dataflow :
Source:
add your tsv file as requested, and make sure to select in After completion ->Delete source files this will save you from adding a delete activity.
Filter activity:
Now, depends on your use case, do you want to filter rows with null values? or do you want to validate that you don't have null values.
if you want to filter, just add a filter activity, in filter settings -> filter on -> 'here add your condition'.
if you need to validate rows and make the dataflow fail, use the assert activity
filter condition : false(isNull(columnName))
Sink:
i added 2 sinks,one for ADE and one for new directory.
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-assert
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-filter
https://microsoft-bitools.blogspot.com/2019/05/azure-incremental-load-using-adf-data.html
please consider the incremental load and change the dataflow accordingly.

Related

How to delete records from a sql database using azure data factory

I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.

Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it

One source to many sinks azure synapse pipeline?

I’m using a copy activity in azure synapse pipeline to copy and filter data from
containerA/file1.csv to containerB/file2US.csv
Similarly I’m using another copy activity to copy and filter data from containerA/file1.csv to containerB/file2IND.csv
The same process for different regions. In every activity I add a where clause to filter the data and copy it into region specific files.
It feels pretty redundant to do this way. Is there any way where I can conditionally check each row and copy it to a different sink based on the region value?
What I’m trying to achieve is a SINGLE ACTIVITY that can select the correct sink based on a condition each row maps to.

The activity you are looking for is called Data Flows. You will use the Conditional Split transformation with as many sinks as you require to achieve this use case.

I would approach this using a For Each activity which runs in parallel and a parameterised Copy activity. You can use an array parameter to list the regions you want to loop through. Here's an example with continents:
["Africa","Antarctica","Asia","Australia","Europe","North America","South America"]
Set up your pipeline like this:
Use the Query in the Sink and and parameterise it with the Add dynamic content button:
Alternately use a Stored Proc. Parameterise the Sink using a dataset parameter. This will give you control of the output filename and location.

Add file name to Copy activity in Azure Data Factory

I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?

In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:

Azure Data Factory Copy Data Activity Mapping in Using Triggers

I am creating a pipeline using ADF to copy the data in a XML file to a SQL database. I want this pipeline to be triggered when the XML file is uploaded to Blob Storage. Therefore, here I will be using a parameter with the input Dataset.
Now, in the Copy Data activity that I am using, I want to be able to define the mappings. This is usually quite easy when the path to the file is given, however, in this situation, where a parameter is being used, how can I do this?
From what I have gathered, the mappings can be defined as a JSON schema and assigned to the activity, but is there perhaps an easier way to do this? Maybe by uploading a demo file from which the schema can be imported?

When you want to load a xml file into sql DB you are using a Hierarchical source to tabular sink method.
When copying data from hierarchical source to tabular sink, copy activity supports the following capabilities:
Extract data from objects and arrays.
Cross apply multiple objects with the same pattern from an array, in which case to convert one JSON object into multiple records in tabular result.
You can define such mapping on Data Factory authoring UI:
On copy activity -> mapping tab, click Import schemas button to import both source and sink schemas. As Data Factory samples the top few objects when importing schema, if any field doesn't show up, you can add it to the correct layer in the hierarchy - hover on an existing field name and choose to add a node, an object, or an array.
Select the array from which you want to iterate and extract data. It will be auto populated as Collection reference. Note only single array is supported for such operation.
Map the needed fields to sink. Data Factory automatically determines the corresponding JSON paths for the hierarchical side.
Note: For records where the array marked as collection reference is empty and the check box is selected, the entire record is skipped.
Here I am using a sample XML file at source
If you notice here I have used a dataset parameter to which I will be assigning the file name value as obtained from trigger. And now I have placed it in the file name field for file path property in dataset connection.
Next I have created a pipeline parameter to hold the input obtained from trigger before assigning it to the dataset parameter.
Create storage event trigger
Click continue and you fill find a preview of all the files that are applicable for trigger conditions
When you have moved to next slide, if you have created pipeline parameter, which we have, you will see them there
Fill in the value as per your need. See the available system variables here Storage event trigger scope
Now, lets move to copy data activity, here you will find the dataset parameter, assign the pipeline parameter value to it.
Now move to sink tab in copy activity, since you want the source schema to be followed into sink, best way is to select to Auto create a table.
For which you have to make appropriate changes in sink dataset. Now, to configure sink dataset, for table choose edit and manually enter a name for table which does not already exist in your server i.e a new table will be created in this name in the sql server mentioned in sink. Make sure you clear all schema as you will be getting source schema in copy activity.
Back to mapping tab in copy activity, click on import schema and select the fields you want to copy to table. Additionally you can specify the data types and Collection reference is necessary.
Refer: Parameterize mapping
You can also switch to Advanced editor, in which case you can directly see and edit the fields' JSON paths. If you choose to add new mapping in this view, specify the JSON path.
So when a file is created in the storage a blob created event is triggered and pipeline runs
You can see the new table "dbo.NewTable" created under ktestsql and it has the data from xml as row.

Copy Blob Data To Sql Database in Azure Data Factory with Conditions

I am performing a a trigger based pipeline to copy data from blob storage to SQL database. In every blob file there are bunch of JSONs from which I need to copy just few of them and I can differenciate them on the basis of a Key-value pair present in every JSON.
So How to filter those JSON containing that Value corresponding to a common key?
One Blob file looks like this. Now While the copy activity is happening ,it should filter data according to the Event- Name: "...".

Data factory in general only moves data, it doesnt modify it. What you are trying to do might be done using a staging table in the sink sql.
You should first load the json values as-is from the blob storage in the staging table, then copy it from the staging table to the real table where you need it, applying your logic to filter in the sql command used to extract it.
Remember that sql databases have built in functions to treat json values: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
Hope this helped!

At this time we do not have an option for the copy activity to filter the content( with an exception of sql source ) .
In your scenario it looks like that already know which values needs to omitted , on way to go will be have a "Stored Procedure" activity , after the copy activity which will be just delete the values which you don't want from the table ,this should be easy to implement but depending on the volume of data it may lead to performance issues . The other option is to have the JSON file cleaned on the storage side before it is ingested .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string