I’m using a copy activity in azure synapse pipeline to copy and filter data from
containerA/file1.csv to containerB/file2US.csv
Similarly I’m using another copy activity to copy and filter data from containerA/file1.csv to containerB/file2IND.csv
The same process for different regions. In every activity I add a where clause to filter the data and copy it into region specific files.
It feels pretty redundant to do this way. Is there any way where I can conditionally check each row and copy it to a different sink based on the region value?
What I’m trying to achieve is a SINGLE ACTIVITY that can select the correct sink based on a condition each row maps to.
The activity you are looking for is called Data Flows. You will use the Conditional Split transformation with as many sinks as you require to achieve this use case.
I would approach this using a For Each activity which runs in parallel and a parameterised Copy activity. You can use an array parameter to list the regions you want to loop through. Here's an example with continents:
["Africa","Antarctica","Asia","Australia","Europe","North America","South America"]
Set up your pipeline like this:
Use the Query in the Sink and and parameterise it with the Add dynamic content button:
Alternately use a Stored Proc. Parameterise the Sink using a dataset parameter. This will give you control of the output filename and location.
Related
I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it
I'm facing a pretty interesting task to convert an arbitrary CSV file to a JSON structure following this schema:
{
"Data": [
["value_1", "value_2"],
["value_3", "value_4"]
]
}
In this case, the input file will look like this:
value_1,value_2
value_3,value_4
The requirement is to use Azure Data Factory and I won't be able to delegate this task to Azure Functions or other services.
I'm thinking about using 'Copy data' activity but can't get my mind around the configuration. TabularTranslator seems to only work with a definite number of columns but the CSV that I can receive can contain any number of columns.
Maybe DataFlows can help me but their setup doesn't look to be an easy one either. Plus, if I get it correctly, DataFlows take more time to start up.
So, basically, I just need to take the CSV content and put it into "Data" 2d array.
Any ideas on how to accomplish this?
To achieve this requirement, using Copy data or TabularTranslator is complicated. This can be achieved using dataflows in the following way.
First create a source dataset using the following configurations. This allows us to read entire row as a single column value (string):
Import the projection and name the column as data. The following is how the data preview looks like:
Now, first split these column values using split function in derived column transformations. I am replacing the same column using split(data,',').
Then, I have added a key column with a constant value 'x' so that I can group all rows and covert the grouped data into array of arrays.
The data would look like this after the above step:
Use aggregate transformation to group by the above created column and use collect aggregate function to create array of arrays (collect(data)).
Use select transformation to select only the above created column Data.
Finally, in the sink, select your destination and create a sink JSON dataset. Choose output to single file in settings and give a file name.
Create dataflow pipeline activity and run the above dataflow. The file will be created, and it looks like the following:
I am new to Azure Data Factory, and I currently have the following setup for a pipeline.
Azure Data Factory Pipeline
Inside the for each
The pipeline does the following:
Reads files for a directory everyday
Filters the children in the directory based on file type [only selects TSV files]
Iterates over each file and copies the data to Azure Data Explorer if they have the correct schema, which I have defined in mapping for the copy activity.
It copied files are then moved to a different directory and deleted from the original directory so that they aren't copied again.
[Question]: I want to delete or skip the rows which have null value in any one of the attributes.
I was looking into using data flow, but I am not sure how to use data flows to read multiple tsv files and validate their schema before applying transformations to delete the null records.
Please let me know if there is a solution where I can skip the null values in the for each loop or if I can use data flow to do the same.
If I can use data flow, how do I read multiple files and validate their column names (schema) before applying row transformations?
Any suggestions that would help me delete or skip those null values will be hugely helpful
Thanks!
Ok, inside the ForEach activity, you only need to add a dataflow activity.
The main idea is to do the filter/assert activity then you write to multiple sinks.
ADF dataflow :
Source:
add your tsv file as requested, and make sure to select in After completion ->Delete source files this will save you from adding a delete activity.
Filter activity:
Now, depends on your use case, do you want to filter rows with null values? or do you want to validate that you don't have null values.
if you want to filter, just add a filter activity, in filter settings -> filter on -> 'here add your condition'.
if you need to validate rows and make the dataflow fail, use the assert activity
filter condition : false(isNull(columnName))
Sink:
i added 2 sinks,one for ADE and one for new directory.
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-assert
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-filter
https://microsoft-bitools.blogspot.com/2019/05/azure-incremental-load-using-adf-data.html
please consider the incremental load and change the dataflow accordingly.
I tried to use this article in successfully copying data from one table to another using Dataflows in Data factory. Now my scenario is to handle multiple tables in the DB. the above example is for one of the table.
I tried to follow the next article (link) in same series and have created View and For each loop but now wondering how should I put the input in Data Flow activity.
Any ideas or if any one tried the same thing.
Thanks
You will need to use a parameterized dataset that uses a dataset parameter for the name of the table. Then, pass a string parameter from the Foreach activity that contains the table name into the dataset parameter for that data flow activity. This will all be accomplished from the pipeline.
I am performing a a trigger based pipeline to copy data from blob storage to SQL database. In every blob file there are bunch of JSONs from which I need to copy just few of them and I can differenciate them on the basis of a Key-value pair present in every JSON.
So How to filter those JSON containing that Value corresponding to a common key?
One Blob file looks like this. Now While the copy activity is happening ,it should filter data according to the Event- Name: "...".
Data factory in general only moves data, it doesnt modify it. What you are trying to do might be done using a staging table in the sink sql.
You should first load the json values as-is from the blob storage in the staging table, then copy it from the staging table to the real table where you need it, applying your logic to filter in the sql command used to extract it.
Remember that sql databases have built in functions to treat json values: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
Hope this helped!
At this time we do not have an option for the copy activity to filter the content( with an exception of sql source ) .
In your scenario it looks like that already know which values needs to omitted , on way to go will be have a "Stored Procedure" activity , after the copy activity which will be just delete the values which you don't want from the table ,this should be easy to implement but depending on the volume of data it may lead to performance issues . The other option is to have the JSON file cleaned on the storage side before it is ingested .