Triggered copy data from blob to ADLS extracting path from filename - azure

I am trying to centralize our data to a ADLSgen2 data lake. One of our datasets is 'dumped' in a blob storage and I want to have a triggered copy.
The files that are stored in the blob storage have a data as a filename (can be a arbitrary date) in JSON format. What I want is that new files are (binary) copied to a folder on the data lake with path using pieces of the date that are present in the filename.
2020-01-01.json → raw/blob/2020/01/raw_reports_blob_2020-01-01.json
First I tried a data copy job and a Pipeline in Azure Synapse but I am not sure how to set the sink path with details from source filename. It seems that the copy-data-tool cannot be triggered by new blob files. The pipeline method looks pretty powerful and I guess it is possible. What I want is not that difficult on Linux so I guess it must be possible in Azure as well.
Second, I tried to create an Azure Function as I am pretty comfortable with Python, however here I have a similar problem as I need to define in/out bindings. The out bindings are defined at design time and do not give me the freedom to the kind of path based on the source filename. Also, it feels somewhat overkill for a simple binary copy action. I can have the function triggered with new files in the blob and reading them is no problem.
I am relatively new to Azure and any help towards a solution is more than welcome.

See this answer as well: https://stackoverflow.com/a/66393471/496289
There is concept of "copy" per sē in ADLS. You read/download from source and write/upload to target.
As someone mentioned Data Factory can do this.
You can also use:
azcopy from a Power Shell Azure Function. azcopy cp "https://[srcaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"
Python/Java/... Azure Function. You'll have to download the file (in chunks if it's big) and upload it (in chunks if big).
Databricks. This would be similar misuse of a tool as using Azure Synapse Analytics to copy data between storage accounts.
Azure Logic apps. See this and this. Never used them, but I believe they are less code than Azure Function and have some programming capabilities, if it helps you create destination path programmatically.
Things to remember:
Data Factory, can be expensive. Especially compared to Azure Functions on consumption plan.
Azure Functions on consumption plan have 10 minute max before they timeout. So can't use it if you have files in GBs/TBs.
You'll be paying egress costs if applicable.

Related

Staging or landing on Azure

I am performing ETL in Azure Data Factory and I just wanted to confirm my understanding of it before going further. Please find the image attached below.
I am collecting data from multiple source and storing in Azure Blob Storage then perform Transformation and Loading. What I am confused about is that whether Azure Blob Storage is a landing or staging area here in my case. Some people use these terms interchangeably and couldn't understand the fine line between these two terms.
Also, can anyone explain me which part is Extract, Transform and Load is. In my understating, collecting the data from multiple source and store into Azure Blob Storage is Extracting, Azure Data Factory is Transformation and copying the transformed data into Azure Database is Loading. Am i correct or is there something I am misunderstanding here?
What I am confused about is that whether Azure Blob Storage is a
landing or staging area here in my case.
In your case, Azure Blob Storage is both landing area and staging area. Landing area means a area collecting data from different places. Staing area means it only save data for a little time, staging data should be deleted during ETL process.
Also, can anyone explain me which part is Extract, Transform and Load
is.
Copy Activity is a typical technology based on ETL. If only talking about the Copy Activity of Azure Data Factory, after you specify the copy source, the ADF will perform copy activities based on this, this is 'extract'. The part of the ADF that transfers data to the specified Sink according to your settings, this is 'Load', and the details of the copy behavior is 'Transform'. If you look at your entire process, you collect data to blob storage is also 'Extract'.

How to handle Incremental & Full Upload in a Azure Data Factory

We have a Azure Storage Account with 2 blob stores. A Full and a Inc.
In the Full we place the full upload CSV files whenever a Full Upload is needed, in the Inc we just place day by day small incremental CSV Files.
We load all our data first in a staging, then to the ODS en finally to a Edw (Enterprise DW).
A full upload is only needed when there are structural changes to the tables.
Basically the only difference between the two uploads is that the full also cleares all data in the ODS and the EDW, but runs the sames stored procedures in the pipelines, ...
Anybody has tips on how to handle such a situation in a Azure Data Factory.
I would prefer not to double the data-factories, but due to the different avalability/frequency of the output datasets I can't use the same staging logical (in the data-factory) table as output dataset ....
So any hint(s) are appreciated ...
First of all to be clear ADF is just there to invoke other Azure services, it doesn't do any of the work itself. So the question really is; what services in Azure could you call from ADF to do this work and manage this situation?
To answer that...
Option 1: I would suggest you look at Azure Data Lake. I've written simply procedures to what you've described above in USQL where parameters can be passed to the USQL procedures from ADF for different types of behaviour.
The code you create can live in an Azure Data Lake Analytics database, similar to TSQL objects. Then maybe start using Azure Data Lake Storage as well, instead of normal blobs.
Option 2: Break out the C# and create yourself an Azure Data Factory custom activity and create a set of classes to do exactly what you require. Again with params passed by ADF or include logic in the methods to check the 'full' table contents. This will however involve a lot more development work and require an Azure Batch Service for the compute.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

How to clone blob container and contents

Summary
I have picked up support for a fairly old website which stores a bunch of blobs in Azure. What I would like to do is duplicate all of my blobs from live to the test environment so I can use them without affecting users.
Architecture
The website is a mix of VB webforms and MVC, communicating with an Azure blob service (e.g. https://x.blob.core.windows.net/LiveBlobs).
The test site mirrors the live setup, except it points to a different blob container in the same storage account (e.g. https://x.blob.core.windows.net/TestBlobs)
Questions
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
How do I work out what it will cost to do this? The live blob
storage is roughly 130GB, but it should just be copying the data within the same data centre right?
Things I've investigated
I've spent quite some time searching for an answer, but what I've found deals with copying between storage accounts or copying single blobs.
I've also found AzCopy which looks promising but it looks like it would copy the files one by one so I'm worried it would end up taking a long time and costing a lot.
I am fairly new to Azure so please forgive me if this is a silly question or I've missed out some important details. I'm more than happy to add any extra information should you need it.
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
Yes, you can. Copying blob is an asynchronous server-side operation. You simply tell the blob service the blobs to copy & destination details and it will do the job for you. No need to download first and upload them to destination.
How do I work out what it will cost to do this? The live blob storage
is roughly 130GB, but it should just be copying the data within the
same data centre right?
So there are 3 things you need to consider when it comes to costing: 1) Storage costs, 2) transaction costs and 3) data egress costs.
Since the copied blobs will be stored somewhere, they will be consuming storage and you will incur storage costs.
Copy operation will perform some read operations on source blobs and then write operation on destination blobs (to create them), so you will have to incur transaction costs. At very minimum for each blob copy, you can expect 2 transactions - read on source and write on destination (though there can be more transactions).
You incur data egress costs if the destination storage account is not in the same region as your source storage account. As long as both storage accounts are in the same region, you would not incur this.
You can use Azure Storage Pricing Calculator to get an idea about how much it is going to cost you.
I've also found AzCopy which looks promising but it looks like it
would copy the files one by one so I'm worried it would end up taking
a long time and costing a lot.
Blobs are always copied one-by-one. Copying across storage accounts is always async server side operation so you can't really predict how much time it would take for the copy operation to complete but in my experience it is quite fast. If you want to control when the blobs are copied, you would need to download them first and upload them. AzCopy supports this mode as well.
As far as costs are concerned, I think it is a relative term when you say it is going to cost a lot. But in general Azure Storage is very cheap and 130 GB is not a whole lot of data.

Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?
I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.
Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.
Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Resources