Copy last added file to a GCS bucket into Azure Blob storage - azure

I'm VERY new to Azure Data Factory, so pardon me for my stupid or obvious question.
I want to schedule the copy files stored in a GCS bucket in Azure Blob Storage once a day. Until now, I managed to copy (both manually and by scheduling the activity of the pipeline) file from the bucket in GCS where I'm uploading the file manually.
In the near future, the upload will happen automatically once a day at a given time,
presumably during the night. My goal is to schedule the copy of just the last added file and avoid copying every time all the file,
overwriting the existing ones.
It's something that requires writing some python script? Is there some parameter to set?
Thank you all in advance for the replies.

There is no need of any explicit coding.
Adf support simple copy activity to move data from gcs to blob storage wherein your gcs would act as source and blob storage would act as sink in copy activity.
https://learn.microsoft.com/en-us/azure/data-factory/connector-google-cloud-storage?tabs=data-factory
And to get the latest file, you can use get meta data activity to get list of files and filter for the latest file

Related

Delete a file from a blob container in a Sata Share connection

o/
I have a Azure Data Share connection that I receive data in a parquet format everyday into my blob container. If a file is deleted in the source blob storage, it would be deleted on my side too?
I'm saying this because the company that sends the data deleted all data and create new ones with other names, so in this case I would have all the historical data, plus the new data right?
The connection it's set up as incremental
When file systems, containers, or folders are shared in snapshot-based sharing, data consumers can choose to make a full copy of the share data. Or they can use the incremental snapshot capability to copy only new or updated files. The incremental snapshot capability is based on the last modified time of the files.
Existing files that have the same name are overwritten during a snapshot. A file that is deleted from the source isn't deleted on the target. Empty subfolders at the source aren't copied over to the target.
For more details, refer to Share and receive data from Azure Blob Storage and Azure Data Lake Storage.

How to skip already copied files in Azure data factory, copy data tool?

I want to copy data from blob storage(parquet format) to cosmos db. Scheduled the trigger for every one hour. But all the files/data getting copied in every run. how to skip the files that are already copied?
There is no unique key with the data. We should not copy the same file content again.
Based on your requirements, you could get an idea of modifiedDatetimeStart and modifiedDatetimeEnd properties in Blob Storage DataSet properties.
But you need to modify the configuration of dataset every period of time via sdk to push the value of the properties move on.
Another two solutions you could consider :
1.Using Blob Trigger Azure Function.It could be triggered if any modifications on the blob files then you could transfer data from blob to cosmos db by sdk code.
2.Using Azure Stream Analytics.You could configure the input as Blob Storage and output as Cosmos DB.

Blob Store to Blob Store

I'm currently working on a project for one our managed services clients.
We're looking to take data out of their blob store (a) and move it into another blob store (b) using AzCopy.
My question is will blob store B update from blob store (b) when new data arrives or will we have to do a full copy each time we want to move new data across?
Seems like a silly question however I couldn't find out online the answer to my question.
Thanks in advance!
AZCopy is just a command line tool that will allow you to copy blob x or container y from storage account A to storage account B, it's not doing anything special. If a blob already exists it will give you the option to not copy or to overwrite, like any normal copy operation.
The choice of what it copies will be down to the script you are running that triggers AZCopy, what are you telling it to copy?
You also might want to look at Azure Data Factory for doing blob to blob copies.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?
I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.
Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.
Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Resources