I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:
Related
I am using azure data factory to have a soap API connection data to be transferred to snowflake. I understand that snowflake has to have the data in variant column or csv or we need to have intermediate storage in azure to finally land the data in snowflake. the problem I faced is the data from api is a string within that there is xml data. so when i put the data in blob storage, its a string. how do I avoid this and have the proper columns while putting the data ?
over here, the column is read as string. is there a way to parse it into their respective rows ? I tried to put the collection reference, it still does not recognize individual columns. Any input is highly appreciated.
You need to change to Advanced editor in Mapping section of copy activity. I took the sample data and repro'd this. Below are the steps.
Img:1 Source dataset preview
In mapping section of copy activity,
Click Import Schema
Switch to Advanced editor .
Give the collection reference value.
Img:2 Mapping settings
This is my first question ever so thanks in advance for answering me.
I want to create an external table by Spark in Azure Databricks. I've the data in my ADLS already that are automatically extracted from different sources every day. The folder structure on the storage is like ../filename/date/file.parquet.
I do not want to duplicate the files by saving their copy on another folder/container.
My problem is that I want to add a date column extracted from the folder path to the table neither without copying nor changing the source file.
I am using Spark SQL to create the table.
CREATE TABLE IF EXISTS my_ext_tbl
USING parquet
OPTIONS (path "/mnt/some-dir/source_files/")
Is there any proper way to add such a column in one easy and readable step or I have to read the raw data into Dataframe, add column and then save it as external tabel to different location?
I am aware of that unmanaged tables stores only metadata in dbfs. However, I am wondering is this even possible.
Hope it's clear.
EDIT:
Since it seems like there is no viable solution for that without copying or interfere in source file, I would like to ask how are you handling such challenges?
EDIT2:
I think that link might provide a solution. The difference in my case is that, the date inside the folder path is not the real partition, it's just a date added during the pipeline extracting data from external source.
I need to copy file names of excel files that are in my Azure Storage as blobs and then put these names in the SQL Server table using ADF. It can be a file path as a name of a file but the hardest thing is that in the dataset which takes all the files from one specific folder I have to select a sheet name and these sheet names are different for each file, therefore it returns an error. Is there a way to create a collective dataset without indicating the sheet name?
So, if I understand your question correctly you are looking for a way to write all Excel filenames to a SQL Database using ADF.
You can use the generic Get Metadata activity and use a binary dataset as source. Select Child items as an field to retrieve. This will retrieve all files in the folder. Then add a filter to only select the Excel file types.
Hope that this gets you on the right track.
I am creating a pipeline using ADF to copy the data in a XML file to a SQL database. I want this pipeline to be triggered when the XML file is uploaded to Blob Storage. Therefore, here I will be using a parameter with the input Dataset.
Now, in the Copy Data activity that I am using, I want to be able to define the mappings. This is usually quite easy when the path to the file is given, however, in this situation, where a parameter is being used, how can I do this?
From what I have gathered, the mappings can be defined as a JSON schema and assigned to the activity, but is there perhaps an easier way to do this? Maybe by uploading a demo file from which the schema can be imported?
When you want to load a xml file into sql DB you are using a Hierarchical source to tabular sink method.
When copying data from hierarchical source to tabular sink, copy activity supports the following capabilities:
Extract data from objects and arrays.
Cross apply multiple objects with the same pattern from an array, in which case to convert one JSON object into multiple records in tabular result.
You can define such mapping on Data Factory authoring UI:
On copy activity -> mapping tab, click Import schemas button to import both source and sink schemas. As Data Factory samples the top few objects when importing schema, if any field doesn't show up, you can add it to the correct layer in the hierarchy - hover on an existing field name and choose to add a node, an object, or an array.
Select the array from which you want to iterate and extract data. It will be auto populated as Collection reference. Note only single array is supported for such operation.
Map the needed fields to sink. Data Factory automatically determines the corresponding JSON paths for the hierarchical side.
Note: For records where the array marked as collection reference is empty and the check box is selected, the entire record is skipped.
Here I am using a sample XML file at source
If you notice here I have used a dataset parameter to which I will be assigning the file name value as obtained from trigger. And now I have placed it in the file name field for file path property in dataset connection.
Next I have created a pipeline parameter to hold the input obtained from trigger before assigning it to the dataset parameter.
Create storage event trigger
Click continue and you fill find a preview of all the files that are applicable for trigger conditions
When you have moved to next slide, if you have created pipeline parameter, which we have, you will see them there
Fill in the value as per your need. See the available system variables here Storage event trigger scope
Now, lets move to copy data activity, here you will find the dataset parameter, assign the pipeline parameter value to it.
Now move to sink tab in copy activity, since you want the source schema to be followed into sink, best way is to select to Auto create a table.
For which you have to make appropriate changes in sink dataset. Now, to configure sink dataset, for table choose edit and manually enter a name for table which does not already exist in your server i.e a new table will be created in this name in the sql server mentioned in sink. Make sure you clear all schema as you will be getting source schema in copy activity.
Back to mapping tab in copy activity, click on import schema and select the fields you want to copy to table. Additionally you can specify the data types and Collection reference is necessary.
Refer: Parameterize mapping
You can also switch to Advanced editor, in which case you can directly see and edit the fields' JSON paths. If you choose to add new mapping in this view, specify the JSON path.
So when a file is created in the storage a blob created event is triggered and pipeline runs
You can see the new table "dbo.NewTable" created under ktestsql and it has the data from xml as row.
I am trying copy different csv files in blob storage into there very own sql tables(I want to auto create these tables). Ive seen alot of questions but I haven't seen any that answer this.
Currently I have a getmetadata function that grabs a list of child items to get the name of the files and a foreach loop but from there I don't know how to have them sent to different tables per file.
Updated:
When I run it for a 2nd time. It will add new rows into the table.
I created a simple test and it works well. This is my csv file stored in Azure Data Lake.
Then we can use pipeline to copy this csv file into Azure SQL table(auto create these tables).
At GetMetaData1 activity, we can set the dataset of the folder containing csv files
And select First row as header at the dataset.
2.At ForEach1 activity we can foreach the file list via expression #activity('Get Metadata1').output.childItems.
3.Inside ForEach1 activity, we can use Copy data1 activity with the same data source as GetMetaData1 activity. At source tab, we can type in dynamic content #item().name. We can use #item().name to get the file name.
At sink tab, we should select Auto create table.
In the Azure SQL dataset, we should type in schema name and dynamic content #replace(item().name,'.csv','') as its table name. Because this information is needed to create a table dynamically.
The debug result is as follows: