I was trying to pull/load data from on-prem data lake to azure data lake using Azure Data Factory.
I was just giving query to pull all the columns. My Sink is Azure Data Lake Gen2.
But my Column names are coming wrong in source and sink.
My columns name in on-prem data lake are like user_id, lst_nm, etc. But in Azure it is like user_tbl.user_id, user_tbl.lst_nm , etc Here user_tbl is my table name.
I don't want table name getting added to columns.
Azure won't add the table name itself to the columns, can you check the output of select query that you are sending to source using preview data in ADF, that will show you the actual column names ADF is getting from source and if it doesn't have the table name prefixed then please check if your ADLS Gen 2 destination folder already have any file, if yes then remove the file and try running the pipeline again
Instead of using Copy activity, use Data flow transformation which allows you to change the Column name at destination dynamically.
Or you can also use Move and transform activity which also allows you to change column name. Refer official tutorial: Dynamically set column names in data flows
Also check ADF Mapping Data Flows: Create rules to modify column names
Related
here is my requirement:
I have an excel with few columns in it and few rows with data
I have uploaded this excel in Azure blob storage
Using ADF I need to read this excel and parse the records in it one by one and perform an action of creating dynamic folders in Azure blob.
This needs to be done for each and every record present in the excel.
Each record in the excel has some information that is going to help me create the folders dynamically.
Could someone help me in choosing the right set of activities or data flow in ADF to do this work?
Thanks in advance!
This is my Excel file as a Source.
I have created folders in Blob storage based on Country column.
I have selected DataFlow activity.
As shown in below screenshot, Go to Optimize tab of Sink configuration.
Now select Partition option as Set Partition.
Partition type as Key.
And Unique value per partition as Country column.
Now run Pipeline.
Expected Output:-
Inside these folders you will get files with corresponding data.
I am trying to add an additional column to the Azure Data Explorer sink from the source Blob Storage using the "Additional columns" in the Copy Activity "Source" tab, but I am getting the following error:
"Additional columns are not supported for your sink dataset, please create a new dataset to enable additional columns."
When I changed the sink dataset to a blob storage, it works fine and an additional column gets created. Is there anything I am missing on here when I using the Azure Data Explorer as sink?
Alternatively, how can I add an additional column to the Azure Data Explorer table as a sink?
Additional Columns
As per official doc - This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.
ADX sink doesn't support altering the table using Copy activity.
To add a column to the ADX table, use .alter-merge table command in advance and map the additional column to the target column under the Mapping tab of the Copy activity.
.alter-merge table command
So basically my issue is this, I will use metadata to get the names of the files from a source folder in the storage account in azure. I need to parse that name and insert it into a respective table. example below.
File name would be in this format. customer_GUIID_TypeOfData_other information.csv
i.e. 1c56d6s4s33s4_Sales_09112021.csv
156468a5s5s54_Inventory_08022021.csv
so these are 2 different customers and two different types of information.
the tables in SQL will be exactly that without the date. 156468a5s5s54_Inventory or 1c56d6s4s33s4_Sales
how can I copy the data from the CSV to the respective table depending on the file name? I will also need to insert or update existing rows in the destination table based on a unique identifier in the file dataset using AZURE Data Factory.
Get the file name using Get Metadata activity and copy data from CSV to Azure SQL table using Dataflow activity with Upsert enable.
Input blob files:
Step1:
• Create a Delimiter Source dataset. Create a parameter for a filename to pass it dynamically.
• Create Azure SQL database Sink dataset and create a parameter to pass table name dynamically.
Source dataset:
Sink dataset:
Step2:
• Connect Source dataset to Get Metadata activity and pass “*.csv” as the file name to get a list of all file names from blob folder.
Output of Get Metada:
Step3:
• Connect the output of Get Metadata activity to ForEach loop, to load all the incoming Source files/data to Sink.
• Add expression to the items to get the child items from previous activity.
#activity('Get Metadata1').output.childitems
Step4:
• Add dataflow activity inside foreach loop.
• Connect Source to Source dataset.
Dataflow Source:
Step5:
• Connect Sink to Sink dataset.
• Enable Allow upsert to update if record exists based on unique key column.
Step6:
• Add AlterRow between source and sink to add condition for upsert.
• Upsert when unique key column is not null or is found.
Upsert if: isNull(id)==false()
Step7:
• In the ForEach loop, dataflow settings, add expressions for source filename and sink table name dynamically.
Src_file_name: #item().name
• As we are extracting the sink table name from the source file name. Split the file name based on underscore “_” and then combine 1st 2 strings to eliminate the date part.
Sink_tbname: #concat(split(item().name, '_')[0],'_',split(item().name, '_')[1])
Step8:
When the pipeline is run, you can see the loop executes the number of source files in the blob and loads data to respective tables based on the file name.
I am trying to copy form Azure Table Storage to Azure CosmosDb sql Api using Azure Data Factory V2.
During copy I want to add a new field (column) to each document by concatenating two of the column values from table. E.g. my table has 2 columns imageId and tenantId and I want to make the id of document in cosmos db like image_tenant_ImageID_TenantID.
For this I am trying to add dynamic content for "Additional Columns" under "Source" in the ADF but couldn't figure out how to do that. Can anyone please help with this?
Seems like I can not do it as part of copy activity. So first I copy from table to temporary cosmos container and then use dataflow to create new column by using "Derived column" activity and copy the results to a new container.
Another way to achieve this is using Change Feed to get your documents and then add this property. Then replace documents.
I am trying to ingest a few CSV files from Azure Data Lake into Azure Synapse using Polybase.
There is a fixed set of columns in each CSV file and the column names are given on the first line. However, the columns can come in different ordering sequence.
In Polybase, I need to declare external table which I need to know the exact sequence of columns during design time and hence I cannot create the external table. Are there other ways to ingest the CSV file?
I don't believe you can do this directly with Polybase because as you noted the CREATE EXTERNAL TABLE statement requires the column declarations. At runtime, the CSV data is then mapped to those column names.
You could accomplish this easily with Azure Data Factory and Data Flow (which uses Polybase under the covers to move the data to Synapse) by allowing the Data Flow to generate the table. This works because the table is generated after the data has been read rather than before as with EXTERNAL.
For the sink Data Set, create it with parameterized table name [and optionally schema]:
In the Sink activity, specify "Recreate table":
Pass the desired table name to the sink Data Set from the Pipeline:
Be aware that all string-based columns will be defined as VARCHAR(MAX).