I am trying to load data from csv file to Azure SQL DB using copy activity. First I loaded three files from blob storage to Azure SQL DB. Then again Three new files are uploaded to blob storage and now I want to load only newly added file to azure sql db. File name are in this format: "student_index_date" where index are from 1-6 and I have to make use of this index.
You can use Getmetadata activity to get the list of child items and then filter based on the latest modified dt for the latest file and use it as source in copy activity.
Related
I am trying to transfer on-premise files to azure blob storage. However, out of the 5 files that I have, 1 has "no data" so I can't map the schema. Is there a way I can filter out this file while importing it to azure? Or would I have to import them into azure blob storage as is then filter them to another blob storage? If so, how would I do this?
DataPath
CompleteFiles Nodata
If your on prem source is your local file system, then first copy the files with folder structure to a temporary blob container using azcopy SAS key. Please refer this thread to know about it.
Then use ADF pipeline to filter out the empty files and store it final blob container.
These are my files in blob container and sample2.csv is an empty file.
First use Get Meta data activity to get the files list in that container.
It will list all the files and give that array to the ForEach as #activity('Get Metadata1').output.childItems
Inside ForEach use lookup to get the row count of every file and if the count !=0 then use copy activity to copy the files.
Use dataset parameter to give the file name.
Inside if give the below condition.
#not(equals(activity('Lookup1').output.count,0))
Inside True activities use copy activity.
copy sink to another blob container:
Execute this pipeline and you can see the empty file is filtered out.
If your on-prem source is SQL, use lookup to get the list of tables and then use ForEach. Inside ForEach do the same procedure for individual tables.
If your on-prem source other than the above mentioned also, first try to copy all files to blob storage then follow the same procedure.
I have a scenario where I have to fetch the latest folder from the blob storage container and then process all files under that folder through Azure data factory, currently, all folder name based on timestamp and as we know CloudBlobDirectory don't hold LastModified Date so there is no way to extract metadata from Azure data factory activity like last modified time so that I can iterate with the timestamp and process the content.
Is there any other way to perform something like sort on the folder name and then pick it based on string sort (on folder name )?
I tried something similar using Azure functions.
Please have a look and tell if its of use.
https://www.youtube.com/watch?v=eUMjghIEsjw
I want to copy bulk data from BLOB storage to Azure Synapse with the following structure:
BLOB STORAGE:-
devpresented (Storage account)
processedout (Container)
Merge_user (folder)
> part-00000-tid-89051a4e7ca02.csv
Sales_data (folder)
> part-00000-tid-5579282100a02.csv
SYNAPSE SQLDW:
SCHEMA:- PIPELINEDEMO
TABLE: Merge_user, Sales_data
Using data factory I want to copy BLOB data to Synapse database as below:
BLOB >> SQLDW
Merge_user >> PIPELINEDEMO.Merge_user
Sales_data >> PIPELINEDEMO.Sales_data
The following doc in mentioned for SQL DB to SQL DW:
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy-portal
However, I didn't find anything for BLOB source in data facory.
Can anyone please suggest, how can I move multiple BLOB files to different tables.
If you need to copy the contents of different csv files to different tables in Azure Synapse, then you can enable multiple copy activities within a pipeline.
I create a .csv file, this is the content:
111,222,333,
444,555,666,
777,888,999,
I upload this to my storage account, Then I set this .csv file as the source of the copy activity.
After that, I create a table in azure synapse, and set this as the sink of the copy activity:
create table testbowman4(Prop_0 int, Prop_1 int, Prop_2 int)
At last, trigger this pipeline, you will find the data is in the table:
You can create multiple similar copy activities, and each copy activity performs a copy action from blob to azure synapse.
I have created a pipeline that extract data from a datasource and store it as parquet file in a blob storage Gen2. Next after I checked the file is created by use azure storage explorer, I tried to create an external table from this parquet file in sql server management studio and it gives me the following error message
Msg 15151, Level 16, State 1, Line 1
Cannot find the external data source 'storageGen2FA - File systems', because it does not exist or you do not have permission.
although I checked my permissions and I have the right one to do this operation????
SQL is Azure SQL and Data source exist
thanks in advance
I have a number of files in azure data lake storage, i am creating a pipeline in ADFV2 to get the list of all the files in a folder in ADLS. How to do this?
You should use Get metadata activity.
Check this
You could follow below steps to list files in ADLS.
1: Use ADLS SDK to get the list file names in a specific directory and output the results. Such as Java SDK here. Of course, you could use .net or Python.
// list directory contents
List<DirectoryEntry> list = client.enumerateDirectory("/a/b", 2000);
System.out.println("Directory listing for directory /a/b:");
for (DirectoryEntry entry : list) {
printDirectoryInfo(entry);
}
System.out.println("Directory contents listed.");
2. Compile the file so that it could be executed.Store it into azure blob storage.
3.Use custom activity in azure data factory to configure the blob storage path and execute the program. More details,please follow this document.
You could use custom activity in Azure data factory.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-java-sdk#list-directory-contents