Is a azure blob available for download whilst it is being overwritten with a new version?
From my tests using Cloud Storage Studio the download is blocked until the overwrite is completed, however my tests are from the same machine so I can't be sure this is correct.
If it isn't available during an overwrite, then I presume the solution (to maintain availability) would be to upload using a different blob name and then rename once complete. Does anyone have any better solution than this?
The blob is available during overwrite. What you see will depend on whether you are using a block blob or page blob however. For block blobs, you will download the older version until the final block commit. That final PutBlockList operation will atomically update the blob to the new version. I am not actually sure however for very large blobs that you are in the middle of downloading what happens when a PutBlockList atomically updates the blob. Choices are: a.) request continues with older blob, b.) the connection is broken, or c:) you start downloading bytes of new blob. What a fun thing to test!
If you are using page blobs (without a lease), you will read inconsistent data as the page ranges are updated underneath you. Each page range update is atomic, but it will look weird unless you lease the blob and keep other readers out (readers can snapshot a leased blob and read the state).
I might try to test the block blob update in middle of read scenario to see what happens. However, your core question should be answered: the blob is available.
Related
My question is: Does AZCOPY support moving and removing files from disk after copy to Azure storage?
From the local disk, no, it only copies to the cloud and does not take responsibility for any deletion or move. Best practice is to wrap your AzCopy in a script that takes care of the local file handling based on a successful response from AzCopy.
You could roll your own version using the Storage Data Movement Library.
I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.
Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.
Any insight appreciated - a Google turned up little.
So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.
For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.
Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.
I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.
One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!
Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.
Hope this helps.
I have built an Azure Webjob that takes a queue trigger and an blob reference as input, processes the file and creates multiple output blob files (it's breaking a PDF into individual pages). In order to output the multiple blobs, I have code in the job that explicitly creates the storage/container connection and does the output. It would be cleaner to let webjobs handle this if it's possible using attributes.
Is there a way to output multiple blobs to a container? I can output multiple queue messages using the QueueAttribute and an ICollector, but I don't see if that's possible with a Blob (like a container reference where I can send multiple blobs). Thanks.
Correct - BlobAttribute does not support an ICollector binding. In the current beta release, we have added some new bindings that might help you a bit. For example, you can now bind to a CloudBlobContainer and you could use that to create additional blobs. See the release notes for more details.
Another possibility would be for you to use the IBinder binding (example here). It allows you to imperatively bind to a blob. You could do that multiple times in your function.
I am trying to do an incremental copy of ca. 500.000 blobs from one storage account to another.
However, it seems that if I do not specify a /Pattern: parameter, AzCopy just hangs forever, never finishes.. (I actually stopped the process after about 15 min).
Is half a million (potentially up to 5 million) blobs too much for AzCopy to handle, or am I missing something here?
The command I'm using looks like this:
AzCopy /Source:<src>/documents /SourceKey:<srcKey> /Dest:<dest>/documents /DestKey:<deskKey> /S /XO /Y
Adding the /pattern parameter solves it, but I'd like a complete copy of all blobs in the container.
I have to add, it managed to copy all the blobs already, it is the subsequent runs that fail, when it has to "figure out" which blobs have been added since the last full backup..
Which version of AzCopy are you using? I guess this issue has been fixed for many releases... Several versions ago, AzCopy needs to list all the blobs to be downloaded before starting transfer; currently AzCopy is able to do listing and transfer simultaneously.
For download latest version of AzCopy and find more information, please refer to http://aka.ms/azcopy .
I have a few questions concerning Azure. At this moment I created a VHD image pre-installed with all my software so I can easily redo the same server. All this works perfectly but the next thing i'm working on are the backups.
There is a lot of stuff on the web concerning this but non involve Linux (or I cant find them). There are a few options as I read.
The first option is to create a snapshot and store it in the blob storage. Next thing is HOW? I installed the azure CLI tools via NPM but how to use them? Nothing on the web on how to use them on the command line.
The second thing is to store a ZIP file as blob data. So I can manually manage the backups instead of a complete snapshop. I don't know if this is better or less good but the same goes out for this. How does it work?
I hope someone can pinpoint me in the right direction because I am stuck at this point. As you might know, backups are essential for this to work so without them I can't use Azure.
Thanks for your answer but I am still not able to do this.
root#DEBIAN:/backup# curl https://mystore01.blob.core.windows.net/backup/myblob?comp=snapshot
<?xml version="1.0" encoding="utf-8"?><Error><Code>UnsupportedHttpVerb</Code><Message>The resource doesn't support specified Http Verb.
RequestId:09d3323f-73ff-4f7a-9fa2-dc4e219faadf
Time:2013-11-02T11:59:08.9116736Z</Message></Error>root"DEBIAN:/backup# curl https://mystore01.blob.core.windows.net/backup/myblob?comp=snapshot -i
HTTP/1.1 405 The resource doesn't support specified Http Verb.
Allow: PUT
Content-Length: 237
Content-Type: application/xml
Server: Microsoft-HTTPAPI/2.0
x-ms-request-id: f9cad24e-4935-46e1-bcfe-a268b9c0107b
Date: Sat, 02 Nov 2013 11:59:18 GMT
<?xml version="1.0" encoding="utf-8"?><Error><Code>UnsupportedHttpVerb</Code><Message>The resource doesn't support specified Http Verb.
RequestId:f9cad24e-4935-46e1-bcfe-a268b9c0107b
Time:2013-11-02T11:59:19.8100533Z</Message></Error>root#HSTOP40-WEB01:/backup# ^C
Hope you can help me get it working since the documentation on Azure + Linux is very bad
I don't believe snapshots are implemented in the CLI. You can either work with the REST API for snapshotting directly, or use one of the language SDKs that wrap this functionality (such as Node.js createBlobSnapshot()).
Note that snapshots are point-in-time lists of committed blocks/pages. They're not actual bit-for-bit copies (yet they represent the exact contents of a blob at the moment you take the snapshot). You can then copy the snapshot to a new blob if you want and do anything you want with it (spin up a new vm, whatever). You can even do a blob-copy to a storage account in a separate data center, if you're looking at a DR strategy.
Snapshots will initially take up very little space. If you start modifying the blocks or pages in a blob, then the snapshot starts growing (as there needs to be blocks/pages representing the original content). You can take unlimited snapshots, but you should consider purging them over time.
If you needed to restore your VM image to a particular point in time, you can copy any one of your snapshots to a new blob (or overwrite the original blob) and restart your VM based on the newly-copied vhd.
You can store anything you want in a blob, including zip files. Not sure what the exact question is on that, but just create a zip and upload it to a blob.