Azure Data Factory Only Retrieve New Blob files from Blob Storage - azure

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?

I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.

Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.

Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Related

How to execute a trigger based on Blob created in Azure Data Factory?

I have pipeline executes with a trigger every time that a blob storage is created. Sometimes the process needs to execute many files at once, so I created in my pipeline a 'For Each' activity as follow, in order to load data when multiple blob storages are created:
That part of the pipeline uploads the data of every blob in the container to a SQL Data Base, and here is the problem, when I execute manually everything is fine, but when the trigger is executed, it executes many times as the number of blob storages in the container, and load the data multiple times no matter what (down bellow is the trigger configuration).
What I'm doing wrong? Is there any way to execute just one time the pipeline by using a trigger when a blob storage is created no matter how many files are in the container?
Thanks by the way, best regards.
Your solution triggers on a storage event. So that part is working.
When triggered, it retrieves all files in the container and processes every blob in that container. Not working as intended.
I think you have a few options here. You may want to follow this MSFT tutorial where they use a single copy activity to a sink. Step 11 shows you have to pass the #triggerBody().path & #triggerBody().fileName to the copy activity.
The other options is to aggregate all blob storage events and use a batch proces to do the operation.
I would first try the simple one-on-one processing option first.

Triggered copy data from blob to ADLS extracting path from filename

I am trying to centralize our data to a ADLSgen2 data lake. One of our datasets is 'dumped' in a blob storage and I want to have a triggered copy.
The files that are stored in the blob storage have a data as a filename (can be a arbitrary date) in JSON format. What I want is that new files are (binary) copied to a folder on the data lake with path using pieces of the date that are present in the filename.
2020-01-01.json → raw/blob/2020/01/raw_reports_blob_2020-01-01.json
First I tried a data copy job and a Pipeline in Azure Synapse but I am not sure how to set the sink path with details from source filename. It seems that the copy-data-tool cannot be triggered by new blob files. The pipeline method looks pretty powerful and I guess it is possible. What I want is not that difficult on Linux so I guess it must be possible in Azure as well.
Second, I tried to create an Azure Function as I am pretty comfortable with Python, however here I have a similar problem as I need to define in/out bindings. The out bindings are defined at design time and do not give me the freedom to the kind of path based on the source filename. Also, it feels somewhat overkill for a simple binary copy action. I can have the function triggered with new files in the blob and reading them is no problem.
I am relatively new to Azure and any help towards a solution is more than welcome.
See this answer as well: https://stackoverflow.com/a/66393471/496289
There is concept of "copy" per sē in ADLS. You read/download from source and write/upload to target.
As someone mentioned Data Factory can do this.
You can also use:
azcopy from a Power Shell Azure Function. azcopy cp "https://[srcaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"
Python/Java/... Azure Function. You'll have to download the file (in chunks if it's big) and upload it (in chunks if big).
Databricks. This would be similar misuse of a tool as using Azure Synapse Analytics to copy data between storage accounts.
Azure Logic apps. See this and this. Never used them, but I believe they are less code than Azure Function and have some programming capabilities, if it helps you create destination path programmatically.
Things to remember:
Data Factory, can be expensive. Especially compared to Azure Functions on consumption plan.
Azure Functions on consumption plan have 10 minute max before they timeout. So can't use it if you have files in GBs/TBs.
You'll be paying egress costs if applicable.

How to skip already copied files in Azure data factory, copy data tool?

I want to copy data from blob storage(parquet format) to cosmos db. Scheduled the trigger for every one hour. But all the files/data getting copied in every run. how to skip the files that are already copied?
There is no unique key with the data. We should not copy the same file content again.
Based on your requirements, you could get an idea of modifiedDatetimeStart and modifiedDatetimeEnd properties in Blob Storage DataSet properties.
But you need to modify the configuration of dataset every period of time via sdk to push the value of the properties move on.
Another two solutions you could consider :
1.Using Blob Trigger Azure Function.It could be triggered if any modifications on the blob files then you could transfer data from blob to cosmos db by sdk code.
2.Using Azure Stream Analytics.You could configure the input as Blob Storage and output as Cosmos DB.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

Fast mechanism for querying Azure blob names

I'm trying to get a list of blob names in Azure and I'm looking for ways to make this operation significantly faster. Within a given sub-folder, the number of blobs can exceed 150,000 elements. The filenames of the blobs are an encoded ID which is what I really need to get at, but I could store that as some sort of metadata if there was a way to query just the metadata or a single field of the metadata.
I'm finding that something as simple as the following:
var blobList = container.ListBlobs(null, false);
can take upwards of 60 seconds to run from my desktop and typically around 15 seconds when running on a VM hosted in Azure. These times are based on a test of 125k blobs in an otherwise empty container and were several hours after they were uploaded, so they've definitely had time to "settle", so to speak.
I've attempted multiple variations and tried using ListBlobsSegmented but it doesn't really help because the function is returning a lot of extra information that I simply don't need. I just need the blob names so I can get at the encoded ID to see what's currently stored and what isn't.
The query for the blob names and extracting the encoded Id is somewhat time sensitive so if I could get it to under 1 second, I'd be happy with it. If I stored the files locally, I can get the entire list of files in a few ms, but I have to use Azure storage for this so that's not an option.
The only thing I can think of to be able to reduce the time it takes to identify the available blobs is to track the names of the blobs being added or removed from a given folder and store it in a separate blob. Then when I need to know the blob names in that folder, I would read the blob with the metadata rather than using ListBlobs. I suppose another would be to use Azure Table storage in a similar way, but it seems like I'm being forced into caching information about a given folder in the container.
Is there a better way of doing this or is this generally what people end up doing when you have hundreds of thousands of blobs in a single folder?
As mentioned, Azure Blob storage is a storage system and doesn't help you in indexing the content. We now have Azure Search Indexer which indexes the content uploaded to Azure Blob storage, refer https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/ with this you can perform all the features supported by Azure Search e.g. listing, searching, paging, sorting etc.. Hope this helps.

Resources