Write text to a file in Azure storage - azure

I have a text file uploaded in my Azure storage account.Now, in my worker role , what i need to do is every time it is run, it fetches some content from Database, and that content must be written in the Uploaded text file, specifically , each time the content of Text file should be overwritten with some new content.
Here, they have given a way to upload a text file to your storage and also delete a file.But i don't want to do that, need to just MODIFY the already present text file each time.

I'm assuming you're referring to storing a file in a Windows Azure blob. If that's the case: A blob isn't a file system; it's just a place to store data (and the notion of a file is a bit artificial - it's just... a blob stored in a bunch of blocks).
To modify the file, you would need to download it and save it to local disk, modify it (again, on local disk), then do an upload. A few thoughts on this:
For this purpose, you should allocate a local disk within your worker role's configuration. This disk will be a logical disk, created on a local physical disk within the machine your vm is running on. In other words, it'll be attached storage and perfect for this type of use.
The bandwidth between your vm instance and storage is 100Mbps per core. So, grabbing a 10MB file, while on a Small instance, would take maybe a second. On an XL, maybe around a tenth of a second. really fast, and varies with VM series (A, D, G) and size.
Because your file is in blob storage, if you felt so inclined to do so (or had the need for this), you could take a snapshot prior to uploading an updated version. Snapshots are like link-lists to your stored data blocks. And there's no cost to snapshots until, one day, you make a change to existing data (and now you'd have blocks representing both old and new data). An excellent way to preserve versions of a blob on a blob-by-blob basis (and it's trivial to delete snapshots).
Just to make sure this download/modify/upload pattern is clear, here's a very simple example (I just typed this up quickly in Visual Studio but haven't tested it. Just trying to illustrate the point):
// initial setup
var acct = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
var client = acct.CreateCloudBlobClient();
// what you'd call each time you need to update a file stored in a blob
var blob = client.GetContainerReference("mycontainer").GetBlockBlobReference("myfile.txt");
using (var fileStream = System.IO.File.OpenWrite(#"path\myfile.txt"))
{
blob.DownloadToStream(fileStream);
}
// ... modify file...
// upload modified file
using (var fileStream = System.IO.File.OpenWrite(#"path\myfile.txt"))
{
blob.UploadFromStream(fileStream);
}

Related

Batch Copy/Delete some blobs in container

I have a lot thousands containers and each container has up to 10k blobs inside. I have a list of tuple (container, blob) to
copy to another storage
delete later in the original storage
The blobs in containers are not related to each other - random date creation, random names (guids), nothing in common.
Q: is there any efficient way how to do these operations?
I already looked at az-cli and azcopy and haven't found any good way.
I tried e.g. to call azcopy repeatedly for each tuple, but this would take ages. One call to copy the blob took 2sec in average. So it's nice it starts operation in background, but if this "starting operation" takes about 2 seconds, it's pretty useless for my case.
I'm assuming based on the comments that within each container, it's an arbitrary number (and naming) of blobs to copy and delete. And that the delete is only for the blobs copied (not the full container). If so, and want to use something besides REST one suggestion would be Powershell script to read from a file the list of blobs to copy (service side copy) and then separately do a delete (more efficient to do a copy and if successful, then delete) e.g. https://learn.microsoft.com/en-us/powershell/module/az.storage/get-azstorageblobcopystate?view=azps-4.7.0#example-4--start-copy-and-pipeline-to-get-the-copy-status
Cheers, Klaas [Microsoft]

How is the data chunked when using UploadFromStreamAsync and DownloadToStreamAsync when uploading to block blob

I just started learning about Azure blob storage. I have come across various ways to upload and download the data. One thing that puzzles me to when to use what.
I am mainly interested in PutBlockAsync in conjunction with PutBlockListAsync and UploadFromStreamAsync.
As far as I understand when using PutBlockAsync it is up to the user to break the data into chunks and making sure each chunk is within the Azure block blob size limits. There is an id associated with each chunk that is uploaded. At the end, all the ids are committed.
When using UploadFromStreamAsync, how does this work? Who handles chunking the data and uploading it.
Why not convert the data into Stream and use UploadFromStreamAsync all the time and avoid two commits?
You can use fiddler, and observe what happens when use UploadFromStreamAsync.
If the file is larger(more than 256MB), such as 500MB, the Put Block and Put Block List api are called in the background(they are also called when use PutBlockAsync and PutBlockListAsync method)
If the file is small than 256MB, then it(UploadFromStreamAsync) will call the Put Blob api in the background.
I use UploadFromStreamAsync and uploading a file whose size is 600MB, then open the fidder.
Here are some findings from fidder:
1.The large file is broken into small size(4MB) one by one, and calls Put Block api in the background:
2.At the end, the Put Block List api will be called:

Azure Import/Export tool dataset.csv and multiple session folders

I am in the process of copying a large set of data to an Azure Blob Storage area. My source set of data has a large number of files that I do not want to move, so my first thought was to create a DataSet.csv file of just the files I do want to copy. As a test, I created a csv file where each row is a single file that I want to include.
BasePath,DstBlobPathOrPrefix,BlobType,Disposition,MetadataFile,PropertiesFile
"\\SERVER\Share\Folder1\Item1\Page1\full.jpg","containername/Src/Folder1/Item1/Page1/full.jpg",BlockBlob,overwrite,"None",None
"\\SERVER\Share\Folder1\Item1\Page1\thumb.jpg","containername/Src/Folder1/Item1/Page1/thumb.jpg",BlockBlob,overwrite,"None",None
etc.
When I run the Import/Export tool (WAImportExport.exe) it seems to create a single folder on the destination for each file, so that it ends up looking like:
session#1
-session#1-0
-session#1-1
-session#1-2
etc.
All files share the same base, but do output their filename in the CSV. Is there any way to avoid this, so that all the files go into a single "session#1" folder? If possible, I'd like to avoid creating N-thousand folders on the destination drive.
I don't think you should worry about the way the files are stored on the disk, as they will be converted back to the directory structure you specified in the .csv file.
Here's what the documentation says:
How does the WAImportExport tool work on multiple source dir and disks?
If the data size is greater than the disk size, the WAImportExport
tool will distribute the data across the disks in an optimized way.
The data copy to multiple disks can be done in parallel or
sequentially. There is no limit on the number of disks the data can be
written to simultaneously. The tool will distribute data based on disk
size and folder size. It will select the disk that is most optimized
for the object-size. The data when uploaded to the storage account
will be converged back to the specified directory structure.

Is it possible to read text files from Azure Blob storage from the end?

I have rather large blob files that I need to read and ingest only latest few rows of information from. Is there an API (C#) that would read the files from the end until I want to stop, so that my app ingests the minimum information possible?
You should already know that BlockBlobs are designed for sequential access, while Page Blobs are designed for random access. And AppendBlobs for Append operations, which in your case is not what we are looking for.
I believe your solution would be to save your blobs as PageBlob as opposed the default BlockBlob. Once you have a Page Blob, you have nice methods like GetPageRangesAsync which returns an IEnbumerable of PageRange. The latter overrides ToString() method to give you the string content of the page.
Respectfully, I disagree with the answer. While it is true that Page Blobs are designed for random access, they are meant for different purpose all together.
I also agree that Block Blobs are designed for sequential access, however nothing is preventing you from reading a block blob's content from the middle. With the support for range reads in block blob, it is entirely possible for you to read partial contents of a block blob.
To give you an example, let's assume you have a 10 MB blob (blob size = 10485760 bytes). Now you want to read the blob from the bottom. Assuming you want to read 1MB chunk at a time, you would call DownloadRangeToByteArray or DownloadRangeToStream (or their Async variants) and specify 9437184 (9MB marker) as starting range and 10485759 (10MB marker) as ending range. Read the contents and see if you find what you're looking for. If not, you can read blob's contents from 8MB to 9MB and continue with the process.

Is there a way to do symbolic links to the blob data when using Azure Storage to avoid duplicate blobs?

I have a situation where a user is attaching files within an application, these files are then persisted to Azure Blob storage, there is a reasonable likelihood that there are going to be duplicates and I want to put in place a solution where duplicate blobs are avoided.
My first thought is to just name the blob as filename_hash but that only captures a subset of duplicates, then filesize_hash was then next thought.
In doing this though it seems like I am losing some of the flexibility of the blob storage to represent the position in a hierarchy of the file, see: Windows Azure: How to create sub directory in a blob container
So I was looking to see if there was a way to create a blob that referenced the blob data i.e. some for of symbolic link but couldn't find what I wanted.
Am I missing something or should I just go with filesize_hash method and store my hierarchy using an alternative method.
No, there's no symbolic links (source: http://social.msdn.microsoft.com/Forums/vi-VN/windowsazuredata/thread/6e5fa93a-0d09-44a8-82cf-a3403a695922).
A good solution depends on the anticipated size of the files and the number of duplicates. If there aren't going to be many duplicates, or the files are small, then it may actually be quicker and cheaper to live with it - $0.15 per gigabyte per month is not a great deal to pay, compared to the development cost! (That's the approach we're taking.)
If it was worthwhile to remove duplicates I'd use table storage to create some kind of redirection between the file name and the actual location of the data. I'd then do a client-side redirect to redirect the client's browser to download the proper version.
If you do this you'll want to preserve the file name (as that will be what's visible to the user) but you can call the "folder" location what you want.
Another solution to keep all structure of your files but still provide a way to do "symbolic links" could be as follows, but as in the other answer the price might be so small that its not worth the effort of implementing it.
I decided in similar setup to just store the md5 of each uploaded file in a table and then in a year go back and see how many duplicates that got uploaded and how much storage that could be saved. It will at that time make it easy to evaluate if its worth implementing a solution for symbolic links.
The downside of maintaining it all in table storage is that you get a limited query API to your blobs. Instead i would suggest to use the Metadata on blobs for creating links. (meta data turns in to normal headers on the requests when using REST API etc).
So for duplicate blobs, just keep one of them and store a link header telling where the data is.
blob.Metadata.Add("link", dataBlob.Name);
await blob.SetMetadataAsync();
await blob.UploadTextAsync("");
at this point the blob now takes up no data but is still present in storage and will be returned when listing blobs.
Then when accessing data you simply would have to check if a blob has a "link" metadata set or with rest, check if a x-ms-meta-link header is present and then read the data from there instead.
blob.Container.GetBlockBlobReference(blob.Metadata["link"]).DownloadTextAsync()
or any of the other methods for accessing the data.
Above is just the basics and I am sure you can figure out the rest if this is used.

Resources