How to update file Last Modified Date in Azure Blob storage - azure

After uploading file into Azure blob storage I could see original file time stamp is lost.
Some how I need to preserve original file time stamp. Azure blob storage doesn't allow to programmatically update "Last Modified Date".
Please share if any one has come across this situation before.

Since Last Modified Date is a system defined property, you can't really preserve it. Any time a blob is updated, this value will change. One thing you could do is keep the original date/time when the blob was created as blob metadata entry if you're interested in finding out when a blob was created. This however is not fool proof as if the blob is re-uploaded, this value would either change or removed.
Another thing you could do is keep this information in a separate place (Table Storage for example).

Related

Storing file created date in Azure File Storage

I am storing a series of Excel files in an Azure File Storage container for my company. My manager wants to be able to see the file created date for these files, as we will be running monthly downloads of the same reports. Is there a way to automate a means of storing the created date as one of the properties in Azure, or adding a bit of custom metadata, perhaps? Thanks in advance.
You can certainly store the created date as part of custom metadata for the file. However, there are certain things you would need to be aware of:
Metadata is editable: Anybody with access to the storage account can edit the metadata. They can change the created date metadata value or even delete that information.
Querying is painful: Azure File Storage doesn't provide querying capability so if you want to query on file's created date, it is going to be a painful process. First you would need to list all files in a share and then fetch metadata for each file separately. Depending on the number of files and the level of nesting, it could be a complicated process.
There are some alternatives available to you:
Use Blob Storage
If you can use Blob Storage instead of File Storage, use that. Blob Storage has a system defined property for created date so you don't have to do anything special. However like File Storage, Blob Storage also has an issue with querying but it is comparatively less painful.
Use Table Storage/SQL Database For Reporting
For querying purposes, you can store the file's created date in either Azure Table Storage or SQL Database. The downside of this approach is that because it is a completely separate system, it would be your responsibility to keep the data in sync. For example, if a file is deleted, you will need to ensure that entry for the same in the database is also removed.

How to archive Azure blob storage content?

I'm need to store some temporary files may be 1 to 3 months. Only need to keep the last three months files. Old files need to be deleted. How can I do this in azure blob storage? Is there any other option in this case other than blob storage?
IMHO best option to store files in Azure is either Blob Storage or File Storage however both of them don't support auto expiration of content (based on age or some other criteria).
This feature has been requested long back for Blobs Storage but unfortunately no progress has been made so far (https://feedback.azure.com/forums/217298-storage/suggestions/7010724-support-expiration-auto-deletion-of-blobs).
You could however write something of your own to achieve this. It's rather very simple: Periodically (say once in a day) your program will fetch the list of blobs and compare the last modified date of the blob with current date. If the last modified date of the blob is older than the desired period (1 or 3 months like you mentioned), you simply delete the blob.
You can use WebJobs, Azure Functions or Azure Automation to schedule your code to run on a periodic basis. In fact, there's readymade code available to you if you want to use Azure Automation Service: https://gallery.technet.microsoft.com/scriptcenter/Remove-Storage-Blobs-that-aae4b761.
As I know, Azure Blob is a appropriate approach for you to storage some temporary files. For your scenario, I assumed that there is no in-build option for you to delete the old files, and you need to programmatically or manually delete your temporary files.
For a simple way, you could try to upload your blob(file) with the specific format (e.g. https://<your-storagename>.blob.core.windows.net/containerName/2016-11/fileName or https://<your-storagename>.blob.core.windows.net/2016-11/fileName), then you could manually manage your files via Microsoft Azure Storage Explorer.
Also, you could check your files and delete the old files before you uploading the new temporary file. For more details, you could follow storage-blob-dotnet-store-temp-files and override the method CleanStorageIfReachLimit to implement your logic for deleting blobs(files).
Additionally, you could leverage a scheduled Azure WebJob to clean your blobs(files).
You can use Azure Cool Blob Storage.
It is cheaper than Blob storage and is more suitable for archives.
You can store your less frequently accessed data in the Cool access tier at a low storage cost (as low as $0.01 per GB in some regions), and your more frequently accessed data in the Hot access tier at a lower access cost.
Here is a document that explains its features:
https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage/

Fast mechanism for querying Azure blob names

I'm trying to get a list of blob names in Azure and I'm looking for ways to make this operation significantly faster. Within a given sub-folder, the number of blobs can exceed 150,000 elements. The filenames of the blobs are an encoded ID which is what I really need to get at, but I could store that as some sort of metadata if there was a way to query just the metadata or a single field of the metadata.
I'm finding that something as simple as the following:
var blobList = container.ListBlobs(null, false);
can take upwards of 60 seconds to run from my desktop and typically around 15 seconds when running on a VM hosted in Azure. These times are based on a test of 125k blobs in an otherwise empty container and were several hours after they were uploaded, so they've definitely had time to "settle", so to speak.
I've attempted multiple variations and tried using ListBlobsSegmented but it doesn't really help because the function is returning a lot of extra information that I simply don't need. I just need the blob names so I can get at the encoded ID to see what's currently stored and what isn't.
The query for the blob names and extracting the encoded Id is somewhat time sensitive so if I could get it to under 1 second, I'd be happy with it. If I stored the files locally, I can get the entire list of files in a few ms, but I have to use Azure storage for this so that's not an option.
The only thing I can think of to be able to reduce the time it takes to identify the available blobs is to track the names of the blobs being added or removed from a given folder and store it in a separate blob. Then when I need to know the blob names in that folder, I would read the blob with the metadata rather than using ListBlobs. I suppose another would be to use Azure Table storage in a similar way, but it seems like I'm being forced into caching information about a given folder in the container.
Is there a better way of doing this or is this generally what people end up doing when you have hundreds of thousands of blobs in a single folder?
As mentioned, Azure Blob storage is a storage system and doesn't help you in indexing the content. We now have Azure Search Indexer which indexes the content uploaded to Azure Blob storage, refer https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/ with this you can perform all the features supported by Azure Search e.g. listing, searching, paging, sorting etc.. Hope this helps.

Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. One thing to note is that the files do not have a date time stamp. All files are present in a single blob container. New files are added to the same blob container. Do you know how to configure this?
I'd preface this answer with a change in your approach may be warranted...
Given what you've described your fairly limited on options. One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. You loop over all the items within the container and check if it has been processed yet.
The container has a ListBlobs method that would work for this. Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/
foreach (var item in container.ListBlobs(null, true))
{
// Check if it has already been processed or not
}
Note that the number of blobs in the container may be an issue with this approach. If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this.
Please use CloudBlobContainer.ListBlobs(null, true, BlobListingDetails.Metadata) and check CloudBlob.Properties.LastModified for each listed blob.
Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files.
However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time.
Can you explain your scenario further? Is there a reason you want to add data to the SQL tables every 15 minutes? Can you increase that to copy data every hour? Also, how is this data getting into Blob Storage? Is another Azure service putting it there or is it an external application? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage.
Another suggestion would be to create folders for the 15 minute intervals like hhmm. So, for example, a sample folder would be called '0515'. You could even have a parent folder for the year, month and day. This way you can insert the data into these folders in Blob Storage. Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders.
I hope this helps! If you can provide some more information about your problem, I'd be happy to help you further.

Insert with AZCopy to Azure tables with original time stamp

I've successfully copied an Azure Storage Table from Azure to the local Emulator using AZCopy. However when looking at the local table, there is two columns that are named "Timestamp" and "TIMESTAMP". The latter contains the original timestamp, while the first is the timestamp when the row is being inserted.
I cant figure out if it's possible to keep the original timestamp with Azcopy or not? The "Timestamp" column i get is quite useless.
I assume you made two following steps to copy an Azure Storage Table to local Emulator by AzCopy:
Exporting the Azure Storage Table to local files or blobs;
Importing from the local files or blobs to local Emulator table.
Please correct me if my assumption is wrong.
About the column "TIMESTAMP", does your original Azure Storage Table contain this column? If not, it may be an unexpected behavior to us since AzCopy shouldn't introduce any additional columns ("TIMESTAMP" here) after exporting and importing. If this is the case, please do share us more information so that we can verify whether it's a bug in AzCopy.
Regarding your question "if it's possible to keep the original timestamp with Azcopy or not", the answer is NO. Timestamp is a property maintained by Azure Storage Table service, users are not able to customize its value.

Resources