Batch Copy/Delete some blobs in container - azure

I have a lot thousands containers and each container has up to 10k blobs inside. I have a list of tuple (container, blob) to
copy to another storage
delete later in the original storage
The blobs in containers are not related to each other - random date creation, random names (guids), nothing in common.
Q: is there any efficient way how to do these operations?
I already looked at az-cli and azcopy and haven't found any good way.
I tried e.g. to call azcopy repeatedly for each tuple, but this would take ages. One call to copy the blob took 2sec in average. So it's nice it starts operation in background, but if this "starting operation" takes about 2 seconds, it's pretty useless for my case.

I'm assuming based on the comments that within each container, it's an arbitrary number (and naming) of blobs to copy and delete. And that the delete is only for the blobs copied (not the full container). If so, and want to use something besides REST one suggestion would be Powershell script to read from a file the list of blobs to copy (service side copy) and then separately do a delete (more efficient to do a copy and if successful, then delete) e.g. https://learn.microsoft.com/en-us/powershell/module/az.storage/get-azstorageblobcopystate?view=azps-4.7.0#example-4--start-copy-and-pipeline-to-get-the-copy-status
Cheers, Klaas [Microsoft]

Related

Parallel Copy using azcopy

I was using azcopy to copy models from Azure Blob Storage to Azure VM regularly. But when I am copying datasets to to my VM, I am using Azure File Share and moving data into the data disk using cp command. I want to utilize Azcopy to copy data in parallel. I believe one time I heard that AzCopy copy data in parallel but I am not able to find that statement. Maybe I heard it wrong.
I also saw this another question on stackoverflow which talked about parallelism in azcopy. The answer provided a link to the azcopy docuemntation and talked about --parallel-level but when I clicked on it, there is no such thing as it has been stated.
If anyone can redirect me to the azcopy parallel documentation link if it exists it would be really helpful.
AzCopy copies data in parallel by default, but you can change how many files are copied in parallel.
Throughput can decrease when transferring small files. You can
increase throughput by setting the AZCOPY_CONCURRENCY_VALUE
environment variable. This variable specifies the number of concurrent
requests that can occur.
If your computer has fewer than 5 CPUs, then the value of this
variable is set to 32. Otherwise, the default value is equal to 16
multiplied by the number of CPUs. The maximum default value of this
variable is 3000, but you can manually set this value higher or lower.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-configure#optimize-throughput

Copy one file at a time from one container to another using azure data factory v2

I am trying to copy one file from one container to another another in a storage account. the scenario i implemented works fine for a single file. but for multiple files, it is copying both of them in one copy activity. i want the file to be moved one at a time and after a single copy to provide a delay of 1 min, then proceed with the next file copy.
i created a pipeline with the move File template but it did not work for multiple files.
i have taken the source and sink dataset as csv datasets and not binary. i will not be aware of the pattern or the names of the files.
when a user input say about 10 files, i want to copy it one at a time and also provide a delay between each copy. this has to happen between 2 storage account containers.
i have tried to use move files template too. but it did not work for multiple. Please help me.
Sanaa, to force a sequential processing, check the "Sequential" checkbox:
Time delay could be achieved by adding "Wait" action:

Preparing archive data for Stream Analytics Import

Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.
The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.
I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...
One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...
Any suggestions how to tackle this?
EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.
Can you please try below?
Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.
Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.

Is it possible to read text files from Azure Blob storage from the end?

I have rather large blob files that I need to read and ingest only latest few rows of information from. Is there an API (C#) that would read the files from the end until I want to stop, so that my app ingests the minimum information possible?
You should already know that BlockBlobs are designed for sequential access, while Page Blobs are designed for random access. And AppendBlobs for Append operations, which in your case is not what we are looking for.
I believe your solution would be to save your blobs as PageBlob as opposed the default BlockBlob. Once you have a Page Blob, you have nice methods like GetPageRangesAsync which returns an IEnbumerable of PageRange. The latter overrides ToString() method to give you the string content of the page.
Respectfully, I disagree with the answer. While it is true that Page Blobs are designed for random access, they are meant for different purpose all together.
I also agree that Block Blobs are designed for sequential access, however nothing is preventing you from reading a block blob's content from the middle. With the support for range reads in block blob, it is entirely possible for you to read partial contents of a block blob.
To give you an example, let's assume you have a 10 MB blob (blob size = 10485760 bytes). Now you want to read the blob from the bottom. Assuming you want to read 1MB chunk at a time, you would call DownloadRangeToByteArray or DownloadRangeToStream (or their Async variants) and specify 9437184 (9MB marker) as starting range and 10485759 (10MB marker) as ending range. Read the contents and see if you find what you're looking for. If not, you can read blob's contents from 8MB to 9MB and continue with the process.

Is there a way to do symbolic links to the blob data when using Azure Storage to avoid duplicate blobs?

I have a situation where a user is attaching files within an application, these files are then persisted to Azure Blob storage, there is a reasonable likelihood that there are going to be duplicates and I want to put in place a solution where duplicate blobs are avoided.
My first thought is to just name the blob as filename_hash but that only captures a subset of duplicates, then filesize_hash was then next thought.
In doing this though it seems like I am losing some of the flexibility of the blob storage to represent the position in a hierarchy of the file, see: Windows Azure: How to create sub directory in a blob container
So I was looking to see if there was a way to create a blob that referenced the blob data i.e. some for of symbolic link but couldn't find what I wanted.
Am I missing something or should I just go with filesize_hash method and store my hierarchy using an alternative method.
No, there's no symbolic links (source: http://social.msdn.microsoft.com/Forums/vi-VN/windowsazuredata/thread/6e5fa93a-0d09-44a8-82cf-a3403a695922).
A good solution depends on the anticipated size of the files and the number of duplicates. If there aren't going to be many duplicates, or the files are small, then it may actually be quicker and cheaper to live with it - $0.15 per gigabyte per month is not a great deal to pay, compared to the development cost! (That's the approach we're taking.)
If it was worthwhile to remove duplicates I'd use table storage to create some kind of redirection between the file name and the actual location of the data. I'd then do a client-side redirect to redirect the client's browser to download the proper version.
If you do this you'll want to preserve the file name (as that will be what's visible to the user) but you can call the "folder" location what you want.
Another solution to keep all structure of your files but still provide a way to do "symbolic links" could be as follows, but as in the other answer the price might be so small that its not worth the effort of implementing it.
I decided in similar setup to just store the md5 of each uploaded file in a table and then in a year go back and see how many duplicates that got uploaded and how much storage that could be saved. It will at that time make it easy to evaluate if its worth implementing a solution for symbolic links.
The downside of maintaining it all in table storage is that you get a limited query API to your blobs. Instead i would suggest to use the Metadata on blobs for creating links. (meta data turns in to normal headers on the requests when using REST API etc).
So for duplicate blobs, just keep one of them and store a link header telling where the data is.
blob.Metadata.Add("link", dataBlob.Name);
await blob.SetMetadataAsync();
await blob.UploadTextAsync("");
at this point the blob now takes up no data but is still present in storage and will be returned when listing blobs.
Then when accessing data you simply would have to check if a blob has a "link" metadata set or with rest, check if a x-ms-meta-link header is present and then read the data from there instead.
blob.Container.GetBlockBlobReference(blob.Metadata["link"]).DownloadTextAsync()
or any of the other methods for accessing the data.
Above is just the basics and I am sure you can figure out the rest if this is used.

Resources