How to set parquet column name dynamically - azure

I want to merge multiple csv files
in Azure Synapse Pipeline.
And I'll realize it with copy activity, but I am facing with a problem.
There are two types of source files.
The one has the header(file type is 'with header') and other doesn't have the header(file type is 'without header') .
I want to set the scheme using the header of 'with header' file.
But I don't know how to do.
In my opinion, it could be achieved in the following way, is it possible?
Get list of column name of file 'with header' using 'lookup activity'.
2.Set the list of column names to a variable of type array.
3.Use the variable to mapping in 'copy activity' and merge multiple files.
Can I use a list of column names for mapping?
Waiting for help from you.
Any answers would be appreciated.
Thank you.
It is entered in the details of the issue.

Can I use a list of column names for mapping?
No, you cannot use list of columns in Dynamic mapping there you need to specify the mapping in the json form like below
{
"source": {
"name": "Id",
"type": "String",
"physicalType": "String"
},
"sink": {
"name": "Id",
"type": "String",
"physicalType": "String"
}
}
I want to set the scheme using the header of 'with header' file. But I don't know how to do.
In your scenario first, you need to segregate files like with header and without header
After that first you need to get the list of files without header using Get Metadata activity.
Then add the header to it and schema to every file using foreach activity and dataflow now pass the output of get metadata to for-each activity.
After this take dataflow activity and add header to file before it create on file with header columns e.g.
Set without header file to the source 1 and don't select First row as header.
Set header file to the source 2 and don't select First row as header.
At SurrogateKey1 activity , enter row as Key column and 2 as Start value.
At SurrogateKey2 activity , enter row as Key column and 1 as Start value.
Then we can union SurrogateKey1 stream and SurrogateKey2 stream at Union1 activity.
Then we can sort these rows by row at Sort1 activity.
Then In sink mapping remove the row column and save it to files with header folder:
successful execution:
Now with the files with header you can merge them.

Related

Skip null rows while reading Azure Data Factory

I am new to Azure Data Factory, and I currently have the following setup for a pipeline.
Azure Data Factory Pipeline
Inside the for each
The pipeline does the following:
Reads files for a directory everyday
Filters the children in the directory based on file type [only selects TSV files]
Iterates over each file and copies the data to Azure Data Explorer if they have the correct schema, which I have defined in mapping for the copy activity.
It copied files are then moved to a different directory and deleted from the original directory so that they aren't copied again.
[Question]: I want to delete or skip the rows which have null value in any one of the attributes.
I was looking into using data flow, but I am not sure how to use data flows to read multiple tsv files and validate their schema before applying transformations to delete the null records.
Please let me know if there is a solution where I can skip the null values in the for each loop or if I can use data flow to do the same.
If I can use data flow, how do I read multiple files and validate their column names (schema) before applying row transformations?
Any suggestions that would help me delete or skip those null values will be hugely helpful
Thanks!
Ok, inside the ForEach activity, you only need to add a dataflow activity.
The main idea is to do the filter/assert activity then you write to multiple sinks.
ADF dataflow :
Source:
add your tsv file as requested, and make sure to select in After completion ->Delete source files this will save you from adding a delete activity.
Filter activity:
Now, depends on your use case, do you want to filter rows with null values? or do you want to validate that you don't have null values.
if you want to filter, just add a filter activity, in filter settings -> filter on -> 'here add your condition'.
if you need to validate rows and make the dataflow fail, use the assert activity
filter condition : false(isNull(columnName))
Sink:
i added 2 sinks,one for ADE and one for new directory.
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-assert
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-filter
https://microsoft-bitools.blogspot.com/2019/05/azure-incremental-load-using-adf-data.html
please consider the incremental load and change the dataflow accordingly.

When using 'Additional columns' in Azure Data Factory's copy activity, will it duplicate the column on sink side if it exists on source already?

While copying a csv file I need to be sure that a certain column exists on target. This column may or may not exist in the source file. If I use the additional columns part in the copy activity, will it avoid duplication?
An additional column in copy activity is to add additional data columns to copy to sink along with the source data.
This will not validate if the column exists in the sink.
Additional column can store source file path, to duplicate the existing source column as another column, static value, variables, pipeline parameters.
Refer to this MS document to get more details on the Additional column in the copy activity.
You can use the Get Metadata activity to get the column names from source and sink datasets and compare them. Using If condition results, you can copy the structure with Additional column and without Additional column in True and False activities.
Refer to this similar SO link.

Regex Additional Column in Copy Activity Azure Data Factory

I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:

Extracting metadata in Azure Data factory

I have a csv file
Customer,Gender,Age,City
1,Male,23,Chennai
4,Female,34,Madurai
3,Male,23,Bangalore
My Azure SQL DB's table TAB_A has only one column: Column_Name
I need to move the header of csv file into TAB_A such that the result is:
Column_Name
Customer
Gender
Age
City
Is it possible to achieve this functionality with ADF - Mapping Data flow without using Databricks/Python.
I tried with Source - Surrogate Key - Filter. Able to extract header as row. Unable to transpose. Any pointers? Thanks.
I created a simple test and successfully inserted the header into the sql table.
I created a test.csv file, set it as source data, unselect First row as header.
Source data preview is as follows:
Use SurrogateKey1 activity to generate a Row_No column.
SurrogateKey1 activity data preview is as follows:
Use FIlter1 activity to filter header via expression Row_No == 1.
Data preview is as follows:
Use Unpivot1 activity to perform row-column conversion.
Ungroup by Row_No.
Unpivot key: just fill in a column name.
Unpivoted columns: This column name must be consistent with the column name in your sql table. This way ADF will do automatic mapping.
Data preview is as follows:
That's all.

Mixed properties in 'source' column/fields in Azure Data Factory

I'm using Azure Data Factory to copy data from a source folder (Azure Blob) that has multiple folders inside it (and each one of those folders has a year as it's name, and inside the folders are the Excel spreadsheets with the data) to a SQL Server table. I want to iterate through the folders, select the folder name, and insert the name into a column in a table, so for each data read inside the files in the folder, the folder name where this data is will be in the table, like this:
Data 1 |Data 2 |Year
------------------------
A |abc |2020
B |def |2020
C |ghi |2021
D |jkl |2022
E |lmn |2023
My pipeline is like this:
I have a Get Metadata activity called Get Metadata1 pointing to the folders, and after that a ForEach to iterate through the folders with two activities: one "Set variable" activity setting a variable named FolderYear with #item().name as value (to select the folder name) and a Copy activity which creates a additional column into the dataset named Year using the variable.
I'm trying to map the additional Year column to a column in the table, but when I debug the pipeline, the following error appears:
{ "errorCode": "2200", "message": "Mixed properties are used to reference 'source' columns/fields in copy activity mapping. Please only choose one of the three properties 'name', 'path' and 'ordinal'. The problematic mapping setting is 'name': 'Year', 'path': '','ordinal': ''. ", "failureType": "UserError", "target": "Copy data1", "details": [] }
It's possible to insert the folder name which I'm currently iterating into a database column?
I've made a same test and copied the data(include the folder name) into a SQL table successfully.
I have two folders in the container and each folder contains one cvs file for test.
The previous settings are the same as you.
Inside the ForEach activity, I use the Additional columns to add the folder name to the datasource.
After copied into a SQL table, the results show as follow:
Update:
My file structure is as follows:
You can use expression #concat('FolderA/FolderB/',item().name):

Resources