How to skip already copied files in Azure data factory, copy data tool? - azure

I want to copy data from blob storage(parquet format) to cosmos db. Scheduled the trigger for every one hour. But all the files/data getting copied in every run. how to skip the files that are already copied?
There is no unique key with the data. We should not copy the same file content again.

Based on your requirements, you could get an idea of modifiedDatetimeStart and modifiedDatetimeEnd properties in Blob Storage DataSet properties.
But you need to modify the configuration of dataset every period of time via sdk to push the value of the properties move on.
Another two solutions you could consider :
1.Using Blob Trigger Azure Function.It could be triggered if any modifications on the blob files then you could transfer data from blob to cosmos db by sdk code.
2.Using Azure Stream Analytics.You could configure the input as Blob Storage and output as Cosmos DB.

Related

Copy last added file to a GCS bucket into Azure Blob storage

I'm VERY new to Azure Data Factory, so pardon me for my stupid or obvious question.
I want to schedule the copy files stored in a GCS bucket in Azure Blob Storage once a day. Until now, I managed to copy (both manually and by scheduling the activity of the pipeline) file from the bucket in GCS where I'm uploading the file manually.
In the near future, the upload will happen automatically once a day at a given time,
presumably during the night. My goal is to schedule the copy of just the last added file and avoid copying every time all the file,
overwriting the existing ones.
It's something that requires writing some python script? Is there some parameter to set?
Thank you all in advance for the replies.
There is no need of any explicit coding.
Adf support simple copy activity to move data from gcs to blob storage wherein your gcs would act as source and blob storage would act as sink in copy activity.
https://learn.microsoft.com/en-us/azure/data-factory/connector-google-cloud-storage?tabs=data-factory
And to get the latest file, you can use get meta data activity to get list of files and filter for the latest file

I have about 20 files of type excel /pdf which can be dowloaded from an Http Server.I need to load this file into Azure Storage using Data Factory

I have 20 files of type Excel/pdf located in different https server. i need to validate these file and load into azure storage Using Data Factory.I need to do apply some business logic on this data and load into azure SQL Database.I need to if we have to create a pipe line and store this data in azure blob storage and then load into Azure sql Database
I have tried creating copy data in data factory
My idea as below:
No.1
Step 1: Use Copy Activity to transfer data from http connector source into blob storage connector sink.
Step 2: Meanwhile, configure a blob storage trigger to execute your logic code so that the blob data will be processed as soon as it's collected into blob storage.
Step 3: Use Copy Activity to transfer data from blob storage connector source into SQL database connector sink.
No.2:
Step 1:Use Copy Activity to transfer data from http connector source into SQL database connector sink.
Step 2: Meanwhile, you could configure stored procedure to add your logic steps. The data will be executed before inserted into table.
I think both methods are feasible. The No.1, the business logic is freer and more flexible. The No.2, it is more convenient, but it is limited by the syntax of stored procedures. You could pick the solution as you want.
The excel and pdf are supported yet. Based on the link,only below formats are supported by ADF diectly:
i tested for csv file and get the below random characters:
You could refer to this case to read excel files in ADF:How to read files with .xlsx and .xls extension in Azure data factory?

Filter blob data in Copy Activity

I have a copy Activity that copies data from Blob to Azure Data Lake. The Blob is populated by an Azure function with an event hub trigger. Blob files are appended with UNIX timestamp which is the event enqueued time in the event hub. Azure data factory is triggered every hour to merge the files and move them over to Data lake.
Inside the source dataset I have filters by Last Modified date in UTC time out of the box. I can use this but it limits me to use Last modified date in the blob. I want to use my own date filters and decide where I want to apply these filters. Is this possible in Data factory? If yes, can you please point me in the right direction.
For ADF in any case,the only idea that came to my mind is using combination of Look Up Activity ,ForEach Activity and Filter Activity.Maybe it is kind of complex.
1.Use Look up to retrieve the data from the blob file.
2.Use ForEach Activity to loop the result and set your data time filters.
3.Inside the ForEach Activity, do the copy task.
Please refer to this blog to get some clues.
Reviewing your descriptions of all the tasks you did now, I suggest you getting an idea of Azure Stream Analytics Service. No matter the data source is Event Hub or Azure Blob Storage, ASA supports them as input. And it supports ADL as output.
You could create a job to configure input and output,then use popular SQL language to filter your data however you want.Such as Where operator or DataTime Functions.

multiple file processing using ADF

I have created pipeline which does steps
Copy files from azure blob storage and save in Azure data lake store then
Then USql task pick that files and create summarize files in azure data lake store
Next task pick data from that file and save in db
I am passing 2 parameters windowStart and windowEnd and giving date ranage. Issue is it is always process 1 day not sure what is the problem
Note initially i created copy task with tumbling window trigger that copied all files from blob to ADLStore but once i added new tasks and running manually it is processing only one file.
Thanks

Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?
I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.
Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.
Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Resources